Cache Coherence: Part 1CS 740October 27, 2003Topics• The Cache Coherence Problem• Snoopy ProtocolsSlides from Todd Mowry/Culler&SinghCS 740 F’03–2–The Cache Coherence ProblemI/O devicesMemoryP1$ $$P2P312345u = ?u = ?u:5u:5u:5u = 7CS 740 F’03–3–A Coherent Memory System: IntuitionReading a location should return latest value written (by any process)Easy in uniprocessors• Except for I/O: coherence between I/O devices and processors• But infrequent so software solutions work– uncacheable operations, flush pages, pass I/O data through caches Would like same to hold when processes run on different processors • E.g. as if the processes were interleaved on a uniprocessorThe coherence problem is more pervasive and performance-critical in multiprocessors• has a much larger impact on hardware designCS 740 F’03–4–Problems with the IntuitionRecall:• Value returned by read should be last value written But “last” is not well-defined!Even in sequential case:• “last” is defined in terms of program order, not time– Order of operations in the machine language presented to processor– “Subsequent” defined in analogous way, and well definedIn parallel case:• program order defined within a process, but need to make sense oforders across processesMust define a meaningful semantics• the answer involves both “cache coherence” and an appropriate “memory consistency model” (to be discussed in a later lecture)CS 740 F’03–5–Formal Definition of CoherenceResults of a program: values returned by its read operationsA memory system is coherentif the results of any execution of a program are such that for each location, it is possible to construct a hypothetical serial order of all operations to the location that is consistent with the results of the execution and in which:1. operations issued by any particular process occur in the order issued by that process, and2. the value returned by a read is the value written by the last write to that location in the serial orderTwo necessary features:•Write propagation: value written must become visible to others •Write serialization: writes to location seen in same order by all– if I see w1 after w2, you should not see w2 before w1– no need for analogous read serialization since reads not visible to othersCS 740 F’03–6–Cache Coherence SolutionsSoftware Based:• often used in clusters of workstations or PCs (e.g., “Treadmarks”)• extend virtual memory system to perform more work on page faults– send messages to remote machines if necessaryHardware Based:• two most common variations:– “snoopy” schemes» rely on broadcast to observe all coherence traffic» well suited for buses and small-scale systems» example: SGI Challenge– directory schemes» uses centralized information to avoid broadcast» scales well to large numbers of processors» example: SGI Origin 2000CS 740 F’03–7–Shared Caches• Processors share a single cache, essentially punting the problem.• Useful for very small machines. • E.g., DPC in the Encore, Alliant FX/8.• Problems are limited cache bandwidth and cache interference• Benefits are fine-grain sharing and prefetch effectsPPShd. CacheMemoryPMemoryPPCrossbar2-4 way interleaved cacheCS 740 F’03–8–Snoopy Cache Coherence SchemesBasic Idea:• all coherence-related activity is broadcast to all processors– e.g., on a global bus• each processor (or its representative) monitors (aka “snoops”) these actions and reacts to any which are relevant to the current contents of its cache–examples:» if another processor wishes to write to a line, you may need to “invalidate” (I.e. discard) the copy in your own cache» if another processor wishes to read a line for which you have a dirty copy, you may need to supplyMost common approach in commercial multiprocessors.Examples:• SGI Challenge, SUN Enterprise, multiprocessor PCs, etc.CS 740 F’03–9–Implementing a Snoopy ProtocolCache controller now receives inputs from both sides: • Requests from processor, bus requests/responses from snooperIn either case, takes zero or more actions• Updates state, responds with data, generates new bus transactionsProtocol is a distributed algorithm: cooperating state machines• Set of states, state transition diagram, actions Granularity of coherence is typically a cache block• Like that of allocation in cache and transfer to/from cacheCS 740 F’03–10–Coherence with Write-through Caches• Key extensions to uniprocessor: snooping, invalidating/updating caches– no new states or bus transactions in this case– invalidation- versus update-based protocols• Write propagation: even in inval case, later reads will see new value– inval causes miss on later access, and memory up-to-date via write-throughI/O devicesMemP1$Bus snoop$PnCache-memorytransactionCS 740 F’03–11–Write-through State Transition Diagram• Two states per block in each cache, as in uniprocessor– state of a block can be seen as p-vector• Hardware state bits associated with only blocks that are in the cache– other blocks can be seen as being in invalid (not-present) state in that cache• Write will invalidate all other caches (no local change of state)– can have multiple simultaneous readers of block,but write invalidates themIVPrRd/BusRdPrRd/—PrWr/BusWrBusWr/—Processor-initiated transactionsBus-snooper-initiated transactionsPrWr/BusWrPrRDPrWrBusWrBusRdObs/genCS 740 F’03–12–Write-through State Transition Diagram• Two states per block in each cache, as in uniprocessor– state of a block can be seen as p-vector• Hardware state bits associated with only blocks that are in the cache– other blocks can be seen as being in invalid (not-present) state in that cache• Write will invalidate all other caches (no local change of state)– can have multiple simultaneous readers of block,but write invalidates themIVPrRd/BusRdPrRd/—PrWr/BusWrBusWr/—Processor-initiated transactionsBus-snooper-initiated transactionsPrWr/BusWrCS 740 F’03–13–Is WT Coherent?• Assume:• A bus transaction completes before next one starts• Atomic bus transactions• Atomic memory transactions• Write propagation?• Write Serialization?• Key is the bus: writes serialized by busCS 740 F’03–14–Problem with Write-ThroughHigh bandwidth requirements• Every write from every processor goes to shared bus and memory• Consider a 500MHz, 1CPI processor,
View Full Document