Cache Coherence Part II Scalable Approaches Hierarchical Cache Coherence CS 740 November 3 2003 P P P C1 C1 C1 C2 C2 a Topics b Hierarchies arise in different ways Hierarchies Directory Protocols a A processor with an on chip and external cache single cache hierarchy b Large scale multiprocessor using a hierarchy of buses multicache hierarchy Slides based on Todd C Mowry Culler Singh 2 Single Cache Hierarchies CS 740 F 03 Violations of Inclusion The two caches L1 L2 may choose to replace different block P C1 C2 Inclusion property Everything in L1 cache is also present in L2 cache L2 must also be owner of block if L1 has the block dirty Snoop of L2 takes responsibility for recalling or invalidating data due to remote requests Differences in reference history set associative first level cache with LRU replacement example blocks m1 m2 m3 fall in same set of L1 cache Split higher level caches instruction data blocks go in different caches at L1 but may collide in L2 what if L2 is set associative Differences in block size But a common case works automatically L1 direct mapped fewer sets than in L2 and block size same 3 CS 740 F 03 4 CS 740 F 03 Preserving Inclusion Explicitly Hierarchical Snoopy Cache Coherence Simplest way to build large scale cache coherent MPs is to use a hierarchy of buses and use snoopy coherence at each level Two possible ways to build such a machine Propagate lower level L2 replacements to higher level L1 Invalidate or flush if dirty messages Propagate bus transactions from L2 to L1 a All main memory at the global B2 bus b Main memory distributed among the clusters Propagate all transactions or use inclusion bits Propagate modified state from L1 to L2 on writes Write through L1 or modified but stale bit per block in L2 cache Correctness issues altered Not really if all propagation occurs correctly and is waited for Writes commit when they reach the bus acknowledged immediately But performance problems so want to not wait for propagation P P P P P L1 L1 L1 L1 L1 B1 CS 740 F 03 Hierarchies with Global Memory P P P P L1 L1 L1 L1 B1 P L1 L1 L2 L2 B2 B1 B1 Memory L2 L2 Main Memory Mp 6 b CS 740 F 03 Hierarchies w Global Mem Cont Misses to main memory just require single traversal to the root of the hierarchy Placement of shared data is not an issue L2 B2 Disadvantages Main Memory Mp Misses to local data structures e g stack also have to traverse the hierarchy resulting in higher traffic and latency Memory at the global bus must be highly interleaved Otherwise bandwidth to it will not scale First level caches Highest performance SRAM caches B1 follows standard snoopy protocol Second level caches Much larger than L1 caches set assoc Must maintain inclusion L2 cache acts as filter for B1 bus and L1 caches L2 cache can be DRAM based since fewer references get to it 7 Memory B2 Advantages B1 L2 P L1 B1 a 5 P CS 740 F 03 8 CS 740 F 03 Cluster Based Hierarchies Encore Gigamax Motorola 88K processors P P P P L1 L1 L1 L1 B1 B1 Memory L2 L2 Memory P P C C 8 way interleaved memory P P C C Local 64 bit data 32 bit address Nano Bus split transaction 80ns cycles Tag and Data RAMS for remote data cached locally Two 16MB banks 4 way associative UCC UIC Tag RAM only for local data cached remotely UCC UIC Fiber optic link B2 Bit serial 4 bytes every 80ns Key idea Main memory is distributed among clusters reduces global bus traffic local data suitably placed shared data reduces latency less contention and local accesses are faster example machine Encore Gigamax Tag RAM only for remote data cached locally UIC UIC Global Nano Bus 64 bit data 32 bit address split transaction 80ns cycles L2 cache can be replaced by a tag only routercoherence switch 9 CS 740 F 03 10 Cache Coherence in Gigamax Write to local bus is passed to global bus if data allocated in remote Mp allocated local but present in some remote cache Read to local bus passed to global bus if allocated in remote Mp and not in cluster cache allocated local but dirty in a remote cache Write on global bus passed to local bus if allocated in to local Mp allocated remote but dirty in local cache CS 740 F 03 Hierarchies of Rings e g KSR Hierarchical ring network not bus Snoop on requests passing by on ring Point to point structure of ring implies potentially higher bandwidth than buses higher latency Many race conditions possible e g write back going out as request coming in 11 CS 740 F 03 12 CS 740 F 03 Hierarchies Summary Advantages Conceptually simple to build apply snooping recursively Can get merging and combining of requests in hardware Directory Based Cache Coherence Disadvantages Physical hierarchies do not provide enough bisection bandwidth the root becomes a bottleneck e g 2 d 3 d grid problems patch solution multiple buses rings at higher levels Latencies often larger than in direct networks 13 CS 740 F 03 Motivation for Directory Schemes Snoopy schemes do not scale because they rely on broadcast Directory based schemes allow scaling they avoid broadcasts by keeping track of all PEs caching a memory block and then using point to point messages to maintain coherence they allow the flexibility to use any scalable point to point network What Must a Coherent System Do Provide set of states state transition diagram actions Manage coherence protocol 0 Determine when to invoke coherence protocol a Find source of info about state of line in other caches whether need to communicate with other cached copies b Find out where the other copies are c Communicate with those copies inval update 0 is done the same way on all systems state of the line is maintained in the cache protocol is invoked if an access fault occurs on the line Different approaches distinguished by a to c 15 CS 740 F 03 16 CS 740 F 03 Basic Directory Scheme Bus based Coherence P P Cache Cache All of a b c done through broadcast on bus faulting processor sends out a search others respond to the search probe and take necessary action Memory Conceptually simple but broadcast doesn t scale with p on bus bus bandwidth doesn t scale on scalable network every fault leads to at least p network transactions Scalable coherence can have same cache states and state transition diagram different mechanisms to manage protocol 17 Cache Cache Censier Feautrier Assume k processors With each cache block in memory k presence bits and 1 dirty bit Interconnection Network Memory presence bits Directory With each cache block in cache 1 valid bit and 1 dirty owner bit 19 With each cache
View Full Document
Unlocking...