Page 1Cache Coherence forLarge-Scale MachinesTodd C. MowryCS 740September 27, 2007Topics• Hierarchies• Directory ProtocolsCS 740 F’07–2–Hierarchical Cache Coherence• Hierarchies arise in different ways:(a) A processor with an on-chip and external cache (single cache hierarchy)(b) Large scale multiprocessor using a hierarchy of buses (multi-cache hierarchy)PC1C2PC1PC1C2(a)(b)CS 740 F’07–3–Single Cache Hierarchies• Inclusion property: Everything in L1 cache is also present in L2 cache.• L2 must also be owner of block if L1 has the block dirty• Snoop of L2 takes responsibility for recalling or invalidating data due to remote requests• It often helps if the block size in L1 is smaller or the same size as that in L2 cachePC1C2CS 740 F’07–4–Hierarchical Snoopy Cache Coherence• Simplest way to build large-scale cache-coherent MPs is to use a hierarchy of buses and use snoopy coherence at each level.• Two possible ways to build such a machine:(a) All main memory at the global (B2) bus(b) Main memory distributed among the clusters(a)(b)P PL1L1L2B1P PL1L1L2B1B2Mai n Me mo r y ( Mp)P PL2L1L1B1Me mo r yP PL1L1B1L2Me mo r yB2Page 2CS 740 F’07–5–Hierarchies with Global Memory• First-level caches:• Highest performance SRAM caches. • B1 follows standard snoopy protocol• Second-level caches:• Much larger than L1 caches (set assoc). Must maintain inclusion.• L2 cache acts as filter for B1-bus and L1-caches.• L2 cache can be DRAM based, since fewer references get to it.P PL1L1L2B1P PL1L1L2B1B2Mai n Me mo ry ( Mp)CS 740 F’07–6–Hierarchies w/ Global Mem (Cont)Advantages:• Misses to main memory just require single traversal to the root of the hierarchy.• Placement of shared data is not an issue.Disadvantages:• Misses to local data structures (e.g., stack) also have to traverse the hierarchy, resulting in higher traffic and latency.• Memory at the global bus must be highly interleaved. Otherwise bandwidth to it will not scale.CS 740 F’07–7–Cluster Based HierarchiesKey idea: Main memory is distributed among clusters.• reduces global bus traffic (local data & suitably placed shared data)• reduces latency (less contention and local accesses are faster)• example machine: Encore Gigamax• L2 cache can be replaced by a tag-only router-coherence switch.P PL2L1L1B1MemoryP PL1L1B1L2MemoryB2CS 740 F’07–8–Encore GigamaxPCPCUCCUICUICFiber-optic linkUICPCPCUCCUICGlobal Nano BusLocalNano BusMotorola 88K processors8-way interleavedmemory(64-bit data, 32-bit address,split-transaction, 80ns cycles)Tag RAM onlyfor remote datacached locallyTag RAM onlyfor local datacached remotelyTag and Data RAMSfor remote datacached locally(Bit serial,4 bytes every 80ns)(Two 16MB banks4-way associative)(64-bit data, 32-bit address,split-transaction, 80ns cycles)Page 3CS 740 F’07–9–Cache Coherence in Gigamax• Write to local-bus is passed to global-bus if:• data allocated in remote Mp• allocated local but present in some remote cache• Read to local-bus passed to global-bus if:• allocated in remote Mp, and not in cluster cache• allocated local but dirtyin a remote cache• Write on global-bus passed to local-bus if:• allocated in to local Mp• allocated remote, but dirty in local cache• ...• Many race conditions possible (e.g., write-back going out as request coming in)CS 740 F’07–10–Hierarchies of Rings (e.g. KSR)• Hierarchical ring network, not bus• Snoop on requests passing by on ring• Point-to-point structure of ring implies:• potentially higher bandwidth than buses•higher latencyCS 740 F’07–11–Hierarchies: SummaryAdvantages:• Conceptually simple to build (apply snooping recursively)• Can get merging and combining of requests in hardwareDisadvantages:• Physical hierarchies do not provide enough bisection bandwidth (the root becomes a bottleneck, e.g., 2-d, 3-d grid problems)– patch solution: multiple buses/rings at higher levels• Latencies often larger than in direct networksDirectory Based Cache CoherencePage 4CS 740 F’07–13–Motivation for Directory SchemesSnoopy schemes do not scale because they rely on broadcastDirectory-based schemes allow scaling.• they avoid broadcasts by keeping track of all PEs caching a memory block, and then using point-to-point messages to maintain coherence• they allow the flexibility to use any scalable point-to-point network CS 740 F’07–14–Basic Scheme (Censier & Feautrier)• Assume "k" processors. • With each cache-block in memory: k presence-bits, and 1 dirty-bit• With each cache-block in cache: 1valid bit, and 1 dirty (owner) bit•••PPCacheCacheMemory Directorypresence bits dirty bitInterconnection Network• Read from main memory by PE-i:– If dirty-bit is OFF then { read from main memory; turn p[i] ON; }– if dirty-bit is ON then { recall line from dirty PE (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to PE-i; }• Write to main memory:– If dirty-bit OFF then { supply data to PE-i; send invalidations to all PEs caching that block; turn dirty-bit ON; turn P[i] ON; ... }– ...CS 740 F’07–15–Directory Protocol ExamplesPAM/DCPAM/DCPAM/DCRead requestto directoryReply withowner identityRead req.to ownerDataReplyRevision messageto directory1.2.3.4a.4b.PAM/DCPAM/DCPAM/DCRdEx requestto directoryReply withsharers identityInval. req.to sharer1.2.PAM/DCInval. req.to sharerInval. ackInval. ack3a.3b.4a.4b.RequestorNode withdirtycopyDirectorynodefor blockRequestorDirectorynodeSharerSharer(a) Read miss to a block in dirty state(b) Write miss to a block with two sharersMany alternative for organizing directory informationCS 740 F’07–16–Scaling with Number of ProcessorsScaling of memory and directory bandwidth provided• Centralized directory is bandwidth bottleneck, just like centralized memory• How to maintain directory information in distributed way?Scaling of performance characteristics•traffic: # of network transactions each time protocol is invoked•latency: # of network transactions in critical path each timeScaling of directory storage requirements• Number of presence bits needed grows as the number of processorsHow directory is organized affects all these, performance at a target scale, as well as coherence management issuesPage 5CS 740 F’07–17–Insights into DirectoriesInherent program characteristics:• determine whether directories provide big
View Full Document