DOC PREVIEW
CMU CS 15740 - lecture

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Cache Coherence: Part IIScalable ApproachesCS 740November 3, 2003Topics• Hierarchies• Directory ProtocolsSlides based on Todd C. Mowry & Culler/SinghCS 740 F’03–2–Hierarchical Cache Coherence• Hierarchies arise in different ways:(a) A processor with an on-chip and external cache (single cache hierarchy)(b) Large scale multiprocessor using a hierarchy of buses (multi-cache hierarchy)PC1C2PC1PC1C2(a)(b)CS 740 F’03–3–Single Cache Hierarchies• Inclusion property: Everything in L1 cache is also present in L2 cache.• L2 must also be owner of block if L1 has the block dirty• Snoop of L2 takes responsibility for recalling or invalidating datadue to remote requestsPC1C2CS 740 F’03–4–Violations of InclusionThe two caches (L1, L2) may choose to replace different block• Differences in reference history– set-associative first-level cache with LRU replacement– example: blocks m1, m2, m3 fall in same set of L1 cache...• Split higher-level caches– instruction, data blocks go in different caches at L1, but may collide in L2– what if L2 is set-associative?• Differences in block sizeBut a common case works automatically• L1 direct-mapped, fewer sets than in L2, and block size sameCS 740 F’03–5–Preserving Inclusion ExplicitlyPropagate lower-level (L2) replacements to higher-level(L1)• Invalidate or flush (if dirty) messagesPropagate bus transactions from L2 to L1• Propagate all transactions, or use inclusion bitsPropagate modified state from L1 to L2 on writes?• Write-through L1, or modified-but-stale bit per block in L2 cacheCorrectness issues altered?• Not really, if all propagation occurs correctly and is waited for• Writes commit when they reach the bus, acknowledged immediately• But performance problems, so want to not wait for propagationCS 740 F’03–6–Hierarchical Snoopy Cache Coherence• Simplest way to build large-scale cache-coherentMPs is to use a hierarchy of buses and use snoopy coherence at each level.• Two possible ways to build such a machine:(a) All main memory at the global (B2) bus(b) Main memory distributed among the clusters(a)(b)P PL1L1L2B1P PL1L1L2B1B2Ma in Me m o ry ( Mp )P PL2L1L1B1Me mo r yP PL1L1B1L2Me mo r yB2CS 740 F’03–7–Hierarchies with Global Memory• First-level caches:• Highest performance SRAM caches.• B1 follows standard snoopy protocol• Second-level caches:• Much larger than L1 caches (set assoc). Must maintain inclusion.• L2 cache acts as filter for B1-bus and L1-caches.• L2 cache can be DRAM based, since fewer references get to it.P PL1L1L2B1P PL1L1L2B1B2Ma in Me mo ry ( Mp )CS 740 F’03–8–Hierarchies w/ Global Mem (Cont)Advantages:• Misses to main memory just require single traversal to the root ofthe hierarchy.• Placement of shared data is not an issue.Disadvantages:• Misses to local data structures (e.g., stack) also have to traversethe hierarchy, resulting in higher traffic and latency.• Memory at the global bus must be highly interleaved. Otherwise bandwidth to it will not scale.CS 740 F’03–9–Cluster Based HierarchiesKey idea: Main memory is distributed among clusters.• reduces global bus traffic (local data & suitably placed shared data)• reduces latency (less contention and local accesses are faster)• example machine: Encore Gigamax• L2 cache can be replaced by a tag-only router-coherence switch.P PL2L1L1B1MemoryP PL1L1B1L2MemoryB2CS 740 F’03–10–Encore GigamaxPCPCUCCUICUICFiber-optic linkUICPCPCUCCUICGlobal Nano BusLocalNano BusMotorola 88K processors8-way interleavedmemory(64-bit data, 32-bit address,split-transaction, 80ns cycles)Tag RAM onlyfor remote datacached locallyTag RAM onlyfor local datacached remotelyTag and Data RAMSfor remote datacached locally(Bit serial,4 bytes every 80ns)(Two 16MB banks4-way associative)(64-bit data, 32-bit address,split-transaction, 80ns cycles)CS 740 F’03–11–Cache Coherence in Gigamax• Write to local-bus is passed to global-bus if:• data allocated in remote Mp• allocated local but present in some remote cache• Read to local-bus passed to global-bus if:• allocated in remote Mp, and not in cluster cache• allocated local but dirtyin a remote cache• Write on global-bus passed to local-bus if:• allocated in to local Mp• allocated remote, but dirty in local cache• ...• Many race conditions possible (e.g., write-back going out as request coming in)CS 740 F’03–12–Hierarchies of Rings (e.g. KSR)• Hierarchical ring network, not bus• Snoop on requests passing by on ring• Point-to-point structure of ring implies:• potentially higher bandwidth than buses• higher latencyCS 740 F’03–13–Hierarchies: SummaryAdvantages:• Conceptually simple to build (apply snooping recursively)• Can get merging and combining of requests in hardwareDisadvantages:• Physical hierarchies do not provide enough bisection bandwidth (the root becomes a bottleneck, e.g., 2-d, 3-d grid problems)– patch solution: multiple buses/rings at higher levels• Latencies often larger than in direct networksDirectory Based Cache CoherenceCS 740 F’03–15–Motivation for Directory SchemesSnoopy schemes do not scale because they rely on broadcastDirectory-based schemes allow scaling.• they avoid broadcasts by keeping track of all PEs caching amemory block, and then using point-to-point messages to maintain coherence• they allow the flexibility to use any scalable point-to-pointnetworkCS 740 F’03–16–What Must a Coherent System Do?Provide set of states, state transition diagram, & actionsManage coherence protocol(0) Determine when to invoke coherence protocol(a) Find source of info about state of line in other caches– whether need to communicate with other cached copies(b) Find out where the other copies are(c) Communicate with those copies (inval/update)(0) is done the same way on all systems• state of the line is maintained in the cache• protocol is invoked if an “access fault” occurs on the lineDifferent approaches distinguished by (a) to (c)CS 740 F’03–17–Bus-based CoherenceAll of (a), (b), (c) done through broadcast on bus• faulting processor sends out a “search” • others respond to the search probe and take necessary actionCould do it in scalable network too• broadcast to all processors, and let them respondConceptually simple, but broadcast doesn’t scale with p• on bus, bus bandwidth doesn’t scale• on scalable network, every fault leads


View Full Document

CMU CS 15740 - lecture

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?