Hierarchical Cache Coherence Cache Coherence for Large Scale Machines Todd C Mowry CS 740 September 27 2007 P P P C1 C1 C1 C2 C2 a b Hierarchies arise in different ways a A processor with an on chip and external cache single cache hierarchy b Large scale multiprocessor using a hierarchy of buses multicache hierarchy Topics Hierarchies Directory Protocols 2 Single Cache Hierarchies CS 740 F 07 Hierarchical Snoopy Cache Coherence Simplest way to build large scale cache coherent MPs is to use a hierarchy of buses and use snoopy coherence at each level Two possible ways to build such a machine P C1 C2 a All main memory at the global B2 bus b Main memory distributed among the clusters Inclusion property Everything in L1 cache is also present in L2 cache L2 must also be owner of block if L1 has the block dirty Snoop of L2 takes responsibility for recalling or invalidating data due to remote requests It often helps if the block size in L1 is smaller or the same size as that in L2 cache P P P P P P P P L1 L1 L1 L1 L1 L1 L1 L1 B1 L2 B2 B1 B1 B1 L2 Memory L2 L2 B2 Main Memory Mp a 3 CS 740 F 07 4 Page 1 b CS 740 F 07 Memory Hierarchies with Global Memory P P P P L1 L1 L1 L1 B1 Hierarchies w Global Mem Cont Advantages Misses to main memory just require single traversal to the root of the hierarchy Placement of shared data is not an issue B1 L2 L2 Disadvantages B2 Main Memory Mp Misses to local data structures e g stack also have to traverse the hierarchy resulting in higher traffic and latency Memory at the global bus must be highly interleaved Otherwise bandwidth to it will not scale First level caches Highest performance SRAM caches B1 follows standard snoopy protocol Second level caches Much larger than L1 caches set assoc Must maintain inclusion L2 cache acts as filter for B1 bus and L1 caches L2 cache can be DRAM based since fewer references get to it 5 CS 740 F 07 6 CS 740 F 07 Cluster Based Hierarchies Encore Gigamax Motorola 88K processors P L1 P P L1 L1 L2 L2 C C 8 way interleaved memory Local Nano Bus Tag and Data RAMS for remote data cached locally Two 16MB banks 4 way associative Memory UCC UIC P P C C 64 bit data 32 bit address split transaction 80ns cycles Tag RAM only for local data cached remotely UCC Fiber optic link B2 Key idea Main memory is distributed among clusters Bit serial 4 bytes every 80ns reduces global bus traffic local data suitably placed shared data reduces latency less contention and local accesses are faster example machine Encore Gigamax Tag RAM only for remote data cached locally UIC UIC Global Nano Bus L2 cache can be replaced by a tag only routercoherence switch 7 P L1 B1 B1 Memory P P CS 740 F 07 8 Page 2 64 bit data 32 bit address split transaction 80ns cycles CS 740 F 07 UIC Cache Coherence in Gigamax Hierarchies of Rings e g KSR Write to local bus is passed to global bus if data allocated in remote Mp Hierarchical ring network not bus allocated local but present in some remote cache Read to local bus passed to global bus if Snoop on requests passing by on ring allocated in remote Mp and not in cluster cache allocated local but dirty in a remote cache Point to point structure of ring implies Write on global bus passed to local bus if potentially higher bandwidth than buses allocated in to local Mp higher latency allocated remote but dirty in local cache Many race conditions possible e g write back going out as request coming in 9 CS 740 F 07 10 CS 740 F 07 Hierarchies Summary Advantages Conceptually simple to build apply snooping recursively Can get merging and combining of requests in hardware Directory Based Cache Coherence Disadvantages Physical hierarchies do not provide enough bisection bandwidth the root becomes a bottleneck e g 2 d 3 d grid problems patch solution multiple buses rings at higher levels Latencies often larger than in direct networks 11 CS 740 F 07 Page 3 Motivation for Directory Schemes Basic Scheme Censier Feautrier Snoopy schemes do not scale because they rely on broadcast P P Cache Cache Assume k processors With each cache block in memory k presence bits and 1 dirty bit Interconnection Network Directory based schemes allow scaling 13 presence bits Requestor P A A 3 Read req to owner M D 3b Inval req to sharer 3a Inval req to sharer M D 4b Revision message to directory M D Sharer A traffic of network transactions each time protocol is invoked latency of network transactions in critical path each time Scaling of directory storage requirements P C A Node with dirty copy 4b Inval ack P M D Scaling of performance characteristics M D Directory node 4a Inval ack C A A C A C C P 4a Data Reply P P 2 Reply with sharers identity 2 Reply with owner identity Centralized directory is bandwidth bottleneck just like centralized memory How to maintain directory information in distributed way RdEx request to directory C Directory node for block M D CS 740 F 07 Scaling of memory and directory bandwidth provided 1 P Read request to directory C dirty bit With each cache block in cache 1valid bit and 1 dirty owner bit Scaling with Number of Processors b Write miss to a block with two sharers 1 Directory 14 Directory Protocol Examples Requestor Read from main memory by PE i If dirty bit is OFF then read from main memory turn p i ON if dirty bit is ON then recall line from dirty PE cache state to shared update memory turn dirty bit OFF turn p i ON supply recalled data to PE i Write to main memory If dirty bit OFF then supply data to PE i send invalidations to all PEs caching that block turn dirty bit ON turn P i ON CS 740 F 07 a Read miss to a block in dirty state Memory they avoid broadcasts by keeping track of all PEs caching a memory block and then using point to point messages to maintain coherence they allow the flexibility to use any scalable point to point network Number of presence bits needed grows as the number of processors How directory is organized affects all these performance at a target scale as well as coherence management issues M D Sharer Many alternative for organizing directory information 15 CS 740 F 07 16 Page 4 CS 740 F 07 Insights into Directories Cache Invalidation Patterns LU Invalidation Patterns 100 91 22 90 80 Inherent program characteristics 70 60 determine whether directories provide big advantages over broadcast provide insights into how to organize and store directory information 50 40 30 20 0 0 0 5 6 0 0 7 0 8 to 11 0 4 8 75 3 10 2 …
View Full Document
Unlocking...