DOC PREVIEW
CMU CS 15740 - Lecture

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Page 1Cache Coherence forLarge-Scale MachinesTodd C. MowryCS 740September 27, 2007Topics• Hierarchies• Directory ProtocolsCS 740 F’07–2–Hierarchical Cache Coherence• Hierarchies arise in different ways:(a) A processor with an on-chip and external cache (single cache hierarchy)(b) Large scale multiprocessor using a hierarchy of buses (multi-cache hierarchy)PC1C2PC1PC1C2(a)(b)CS 740 F’07–3–Single Cache Hierarchies• Inclusion property: Everything in L1 cache is also present in L2 cache.• L2 must also be owner of block if L1 has the block dirty• Snoop of L2 takes responsibility for recalling or invalidating data due to remote requests• It often helps if the block size in L1 is smaller or the same size as that in L2 cachePC1C2CS 740 F’07–4–Hierarchical Snoopy Cache Coherence• Simplest way to build large-scale cache-coherent MPs is to use a hierarchy of buses and use snoopy coherence at each level.• Two possible ways to build such a machine:(a) All main memory at the global (B2) bus(b) Main memory distributed among the clusters(a)(b)P PL1L1L2B1P PL1L1L2B1B2Mai n Me mo r y ( Mp)P PL2L1L1B1Me mo r yP PL1L1B1L2Me mo r yB2Page 2CS 740 F’07–5–Hierarchies with Global Memory• First-level caches:• Highest performance SRAM caches. • B1 follows standard snoopy protocol• Second-level caches:• Much larger than L1 caches (set assoc). Must maintain inclusion.• L2 cache acts as filter for B1-bus and L1-caches.• L2 cache can be DRAM based, since fewer references get to it.P PL1L1L2B1P PL1L1L2B1B2Mai n Me mo ry ( Mp)CS 740 F’07–6–Hierarchies w/ Global Mem (Cont)Advantages:• Misses to main memory just require single traversal to the root of the hierarchy.• Placement of shared data is not an issue.Disadvantages:• Misses to local data structures (e.g., stack) also have to traverse the hierarchy, resulting in higher traffic and latency.• Memory at the global bus must be highly interleaved. Otherwise bandwidth to it will not scale.CS 740 F’07–7–Cluster Based HierarchiesKey idea: Main memory is distributed among clusters.• reduces global bus traffic (local data & suitably placed shared data)• reduces latency (less contention and local accesses are faster)• example machine: Encore Gigamax• L2 cache can be replaced by a tag-only router-coherence switch.P PL2L1L1B1MemoryP PL1L1B1L2MemoryB2CS 740 F’07–8–Encore GigamaxPCPCUCCUICUICFiber-optic linkUICPCPCUCCUICGlobal Nano BusLocalNano BusMotorola 88K processors8-way interleavedmemory(64-bit data, 32-bit address,split-transaction, 80ns cycles)Tag RAM onlyfor remote datacached locallyTag RAM onlyfor local datacached remotelyTag and Data RAMSfor remote datacached locally(Bit serial,4 bytes every 80ns)(Two 16MB banks4-way associative)(64-bit data, 32-bit address,split-transaction, 80ns cycles)Page 3CS 740 F’07–9–Cache Coherence in Gigamax• Write to local-bus is passed to global-bus if:• data allocated in remote Mp• allocated local but present in some remote cache• Read to local-bus passed to global-bus if:• allocated in remote Mp, and not in cluster cache• allocated local but dirtyin a remote cache• Write on global-bus passed to local-bus if:• allocated in to local Mp• allocated remote, but dirty in local cache• ...• Many race conditions possible (e.g., write-back going out as request coming in)CS 740 F’07–10–Hierarchies of Rings (e.g. KSR)• Hierarchical ring network, not bus• Snoop on requests passing by on ring• Point-to-point structure of ring implies:• potentially higher bandwidth than buses•higher latencyCS 740 F’07–11–Hierarchies: SummaryAdvantages:• Conceptually simple to build (apply snooping recursively)• Can get merging and combining of requests in hardwareDisadvantages:• Physical hierarchies do not provide enough bisection bandwidth (the root becomes a bottleneck, e.g., 2-d, 3-d grid problems)– patch solution: multiple buses/rings at higher levels• Latencies often larger than in direct networksDirectory Based Cache CoherencePage 4CS 740 F’07–13–Motivation for Directory SchemesSnoopy schemes do not scale because they rely on broadcastDirectory-based schemes allow scaling.• they avoid broadcasts by keeping track of all PEs caching a memory block, and then using point-to-point messages to maintain coherence• they allow the flexibility to use any scalable point-to-point network CS 740 F’07–14–Basic Scheme (Censier & Feautrier)• Assume "k" processors. • With each cache-block in memory: k presence-bits, and 1 dirty-bit• With each cache-block in cache: 1valid bit, and 1 dirty (owner) bit•••PPCacheCacheMemory Directorypresence bits dirty bitInterconnection Network• Read from main memory by PE-i:– If dirty-bit is OFF then { read from main memory; turn p[i] ON; }– if dirty-bit is ON then { recall line from dirty PE (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to PE-i; }• Write to main memory:– If dirty-bit OFF then { supply data to PE-i; send invalidations to all PEs caching that block; turn dirty-bit ON; turn P[i] ON; ... }– ...CS 740 F’07–15–Directory Protocol ExamplesPAM/DCPAM/DCPAM/DCRead requestto directoryReply withowner identityRead req.to ownerDataReplyRevision messageto directory1.2.3.4a.4b.PAM/DCPAM/DCPAM/DCRdEx requestto directoryReply withsharers identityInval. req.to sharer1.2.PAM/DCInval. req.to sharerInval. ackInval. ack3a.3b.4a.4b.RequestorNode withdirtycopyDirectorynodefor blockRequestorDirectorynodeSharerSharer(a) Read miss to a block in dirty state(b) Write miss to a block with two sharersMany alternative for organizing directory informationCS 740 F’07–16–Scaling with Number of ProcessorsScaling of memory and directory bandwidth provided• Centralized directory is bandwidth bottleneck, just like centralized memory• How to maintain directory information in distributed way?Scaling of performance characteristics•traffic: # of network transactions each time protocol is invoked•latency: # of network transactions in critical path each timeScaling of directory storage requirements• Number of presence bits needed grows as the number of processorsHow directory is organized affects all these, performance at a target scale, as well as coherence management issuesPage 5CS 740 F’07–17–Insights into DirectoriesInherent program characteristics:• determine whether directories provide big


View Full Document

CMU CS 15740 - Lecture

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?