Cache Coherence Part 1 CS 740 November 1 2002 Topics The Cache Coherence Problem Snoopy Protocols Slides from Todd Mowry Culler Singh The Cache Coherence Problem P2 P1 u P3 u 4 u 7 5 u 5 u 5 1 I O devices 2 u 5 Memory 2 CS 740 F 02 Page 1 3 A Coherent Memory System Intuition Reading a location should return latest value written by any process Easy in uniprocessors Except for I O coherence between I O devices and processors But infrequent so software solutions work uncacheable operations flush pages pass I O data through caches Would like same to hold when processes run on different processors E g as if the processes were interleaved on a uniprocessor The coherence problem is more pervasive and performance critical in multiprocessors has a much larger impact on hardware design 3 CS 740 F 02 Problems with the Intuition Recall Value returned by read should be last value written But last is not well defined Even in sequential case last is defined in terms of program order not time Order of operations in the machine language presented to processor Subsequent defined in analogous way and well defined In parallel case program order defined within a process but need to make sense of orders across processes Must define a meaningful semantics the answer involves both cache coherence and an appropriate memory consistency model to be discussed in a later lecture 4 CS 740 F 02 Page 2 Formal Definition of Coherence Results of a program values returned by its read operations A memory system is coherent if the results of any execution of a program are such that for each location it is possible to construct a hypothetical serial order of all operations to the location that is consistent with the results of the execution and in which 1 operations issued by any particular process occur in the order issued by that process and 2 the value returned by a read is the value written by the last write to that location in the serial order Two necessary features Write propagation value written must become visible to others Write serialization writes to location seen in same order by all if I see w1 after w2 you should not see w2 before w1 no need for analogous read serialization since reads not visible to others 5 CS 740 F 02 Cache Coherence Solutions Software Based often used in clusters of workstations or PCs e g Treadmarks extend virtual memory system to perform more work on page faults send messages to remote machines if necessary Hardware Based two most common variations snoopy schemes rely on broadcast to observe all coherence traffic well suited for buses and small scale systems example SGI Challenge directory schemes uses centralized information to avoid broadcast scales well to large numbers of processors example SGI Origin 2000 6 CS 740 F 02 Page 3 Shared Caches Processors share a single cache essentially punting the problem Useful for very small machines E g DPC in the Encore Alliant FX 8 Problems are limited cache bandwidth and cache interference Benefits are fine grain sharing and prefetch effects P P P P P Crossbar Shd Cache 2 4 way interleaved cache Memory Memory 7 CS 740 F 02 Snoopy Cache Coherence Schemes Basic Idea all coherence related activity is broadcast to all processors e g on a global bus each processor or its representative monitors aka snoops these actions and reacts to any which are relevant to the current contents of its cache examples if another processor wishes to write to a line you may need to invalidate I e discard the copy in your own cache if another processor wishes to read a line for which you have a dirty copy you may need to supply Most common approach in commercial multiprocessors Examples SGI Challenge SUN Enterprise multiprocessor PCs etc 8 CS 740 F 02 Page 4 Implementing a Snoopy Protocol Cache controller now receives inputs from both sides Requests from processor bus requests responses from snooper In either case takes zero or more actions Updates state responds with data generates new bus transactions Protocol is a distributed algorithm cooperating state machines Set of states state transition diagram actions Granularity of coherence is typically a cache block Like that of allocation in cache and transfer to from cache 9 CS 740 F 02 Coherence with Write through Caches Pn P1 Bus snoop I O devices Mem Cache memory transaction Key extensions to uniprocessor snooping invalidating updating caches no new states or bus transactions in this case invalidation versus update based protocols Write propagation even in inval case later reads will see new value inval causes miss on later access and memory up to date via write through 10 CS 740 F 02 Page 5 Write through State Transition Diagram PrWr BusWr PrRd PrRD PrWr BusWr BusRd V BusWr PrRd BusRd Obs gen I Processor initiated transactions PrWr BusWr Bus snooper initiated transactions Two states per block in each cache as in uniprocessor state of a block can be seen as p vector Hardware state bits associated with only blocks that are in the cache other blocks can be seen as being in invalid not present state in that cache Write will invalidate all other caches no local change of state can have multiple simultaneous readers of block but write invalidates them 11 CS 740 F 02 Write through State Transition Diagram PrRd PrWr BusWr V BusW r PrRd BusRd I Processor initiated transactions PrWr BusWr Bus snooper initiated transactions Two states per block in each cache as in uniprocessor state of a block can be seen as p vector Hardware state bits associated with only blocks that are in the cache other blocks can be seen as being in invalid not present state in that cache Write will invalidate all other caches no local change of state can have multiple simultaneous readers of block but write invalidates them 12 CS 740 F 02 Page 6 Is WT Coherent Assume A bus transaction completes before next one starts Atomic bus transactions Atomic memory transactions Write propagation Write Serialization Key is the bus writes serialized by bus 13 CS 740 F 02 Problem with Write Through High bandwidth requirements Every write from every processor goes to shared bus and memory Consider a 500MHz 1CPI processor where 15 of instructions are 8 byte stores Each processor generates 75M stores or 600MB data per second 1GB s bus can support only 1 processor without saturating Write through especially unpopular for SMPs Write back caches absorb most writes as cache hits Write hits don t go on bus But now how do we ensure write propagation and serialization
View Full Document
Unlocking...