DOC PREVIEW
CMU CS 15740 - lect12.4up

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Cache Coherence: Part 1CS 740October 27, 2003Topics• The Cache Coherence Problem• Snoopy ProtocolsSlides from Todd Mowry/Culler&SinghCS 740 F’03–2–The Cache Coherence ProblemI/O devicesMemoryP1$ $$P2P312345u = ?u = ?u:5u:5u:5u = 7CS 740 F’03–3–A Coherent Memory System: IntuitionReading a location should return latest value written (by any process)Easy in uniprocessors• Except for I/O: coherence between I/O devices and processors• But infrequent so software solutions work– uncacheable operations, flush pages, pass I/O data through caches Would like same to hold when processes run on different processors • E.g. as if the processes were interleaved on a uniprocessorThe coherence problem is more pervasive and performance-critical in multiprocessors• has a much larger impact on hardware designCS 740 F’03–4–Problems with the IntuitionRecall:• Value returned by read should be last value written But “last” is not well-defined!Even in sequential case:• “last” is defined in terms of program order, not time– Order of operations in the machine language presented to processor– “Subsequent” defined in analogous way, and well definedIn parallel case:• program order defined within a process, but need to make sense oforders across processesMust define a meaningful semantics• the answer involves both “cache coherence” and an appropriate “memory consistency model” (to be discussed in a later lecture)CS 740 F’03–5–Formal Definition of CoherenceResults of a program: values returned by its read operationsA memory system is coherentif the results of any execution of a program are such that for each location, it is possible to construct a hypothetical serial order of all operations to the location that is consistent with the results of the execution and in which:1. operations issued by any particular process occur in the order issued by that process, and2. the value returned by a read is the value written by the last write to that location in the serial orderTwo necessary features:•Write propagation: value written must become visible to others •Write serialization: writes to location seen in same order by all– if I see w1 after w2, you should not see w2 before w1– no need for analogous read serialization since reads not visible to othersCS 740 F’03–6–Cache Coherence SolutionsSoftware Based:• often used in clusters of workstations or PCs (e.g., “Treadmarks”)• extend virtual memory system to perform more work on page faults– send messages to remote machines if necessaryHardware Based:• two most common variations:– “snoopy” schemes» rely on broadcast to observe all coherence traffic» well suited for buses and small-scale systems» example: SGI Challenge– directory schemes» uses centralized information to avoid broadcast» scales well to large numbers of processors» example: SGI Origin 2000CS 740 F’03–7–Shared Caches• Processors share a single cache, essentially punting the problem.• Useful for very small machines. • E.g., DPC in the Encore, Alliant FX/8.• Problems are limited cache bandwidth and cache interference• Benefits are fine-grain sharing and prefetch effectsPPShd. CacheMemoryPMemoryPPCrossbar2-4 way interleaved cacheCS 740 F’03–8–Snoopy Cache Coherence SchemesBasic Idea:• all coherence-related activity is broadcast to all processors– e.g., on a global bus• each processor (or its representative) monitors (aka “snoops”) these actions and reacts to any which are relevant to the current contents of its cache–examples:» if another processor wishes to write to a line, you may need to “invalidate” (I.e. discard) the copy in your own cache» if another processor wishes to read a line for which you have a dirty copy, you may need to supplyMost common approach in commercial multiprocessors.Examples:• SGI Challenge, SUN Enterprise, multiprocessor PCs, etc.CS 740 F’03–9–Implementing a Snoopy ProtocolCache controller now receives inputs from both sides: • Requests from processor, bus requests/responses from snooperIn either case, takes zero or more actions• Updates state, responds with data, generates new bus transactionsProtocol is a distributed algorithm: cooperating state machines• Set of states, state transition diagram, actions Granularity of coherence is typically a cache block• Like that of allocation in cache and transfer to/from cacheCS 740 F’03–10–Coherence with Write-through Caches• Key extensions to uniprocessor: snooping, invalidating/updating caches– no new states or bus transactions in this case– invalidation- versus update-based protocols• Write propagation: even in inval case, later reads will see new value– inval causes miss on later access, and memory up-to-date via write-throughI/O devicesMemP1$Bus snoop$PnCache-memorytransactionCS 740 F’03–11–Write-through State Transition Diagram• Two states per block in each cache, as in uniprocessor– state of a block can be seen as p-vector• Hardware state bits associated with only blocks that are in the cache– other blocks can be seen as being in invalid (not-present) state in that cache• Write will invalidate all other caches (no local change of state)– can have multiple simultaneous readers of block,but write invalidates themIVPrRd/BusRdPrRd/—PrWr/BusWrBusWr/—Processor-initiated transactionsBus-snooper-initiated transactionsPrWr/BusWrPrRDPrWrBusWrBusRdObs/genCS 740 F’03–12–Write-through State Transition Diagram• Two states per block in each cache, as in uniprocessor– state of a block can be seen as p-vector• Hardware state bits associated with only blocks that are in the cache– other blocks can be seen as being in invalid (not-present) state in that cache• Write will invalidate all other caches (no local change of state)– can have multiple simultaneous readers of block,but write invalidates themIVPrRd/BusRdPrRd/—PrWr/BusWrBusWr/—Processor-initiated transactionsBus-snooper-initiated transactionsPrWr/BusWrCS 740 F’03–13–Is WT Coherent?• Assume:• A bus transaction completes before next one starts• Atomic bus transactions• Atomic memory transactions• Write propagation?• Write Serialization?• Key is the bus: writes serialized by busCS 740 F’03–14–Problem with Write-ThroughHigh bandwidth requirements• Every write from every processor goes to shared bus and memory• Consider a 500MHz, 1CPI processor,


View Full Document

CMU CS 15740 - lect12.4up

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download lect12.4up
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lect12.4up and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lect12.4up 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?