CS 152 Computer Architecture and Engineering Lecture 20: Snoopy CachesRecap: Sequential Consistency A Memory ModelRecap: Sequential ConsistencyRecap: Mutual Exclusion and LocksIssues in Implementing Sequential ConsistencyMemory Fences Instructions to sequentialize memory accessesUsing Memory FencesMemory Consistency in SMPsWrite-back Caches & SCWrite-through Caches & SCMaintaining Sequential ConsistencyCache Coherence Protocols for SCWarmup: Parallel I/OProblems with Parallel I/OSnoopy Cache Goodman 1983Snoopy Cache Actions for DMACS152 AdministriviaShared Memory MultiprocessorCache State Transition Diagram The MSI protocolTwo Processor Example (Reading and writing the same cache line)ObservationMESI: An Enhanced MSI protocol increased performance for private dataOptimized Snoop with Level-2 CachesInterventionFalse SharingSynchronization and Caches: Performance IssuesPerformance Related to Bus OccupancyLoad-reserve & Store-conditionalPerformance: Load-reserve & Store-conditionalOut-of-Order Loads/Stores & CCAcknowledgementsCS 152 Computer Architectureand Engineering Lecture 20: Snoopy CachesKrste AsanovicElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/~krstehttp://inst.cs.berkeley.edu/~cs1524/22/20082CS152-Spring’08Recap: Sequential ConsistencyA Memory Model“ A system is sequentially consistent if the result ofany execution is the same as if the operations of allthe processors were executed in some sequential order, and the operations of each individual processorappear in the order specified by the program” Leslie LamportSequential Consistency = arbitrary order-preserving interleavingof memory references of sequential programsMP P P P P P4/22/20083CS152-Spring’08Recap: Sequential ConsistencySequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example ?T1: T2:Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1 (Y’= Y) Load R2, (X) Store (X’), R2 (X’= X)additional SC requirements4/22/20084CS152-Spring’08Recap: Mutual Exclusion and LocksWant to guarantee only one process is active in a critical section• Blocking atomic read-modify-write instructionse.g., Test&Set, Fetch&Add, Swap vs• Non-blocking atomic read-modify-write instructionse.g., Compare&Swap, Load-reserve/Store-conditionalvs• Protocols based on ordinary Loads and Stores4/22/20085CS152-Spring’08Issues in Implementing Sequential ConsistencyImplementation of SC is complicated by two issues• Out-of-order execution capabilityLoad(a); Load(b) yesLoad(a); Store(b) yes if a bStore(a); Load(b) yes if a bStore(a); Store(b) yes if a b• CachesCaches can prevent the effect of a store from being seen by other processorsMP P P P P PSC complications motivates architects to consider weak or relaxed memory models4/22/20086CS152-Spring’08Memory FencesInstructions to sequentialize memory accessesProcessors with relaxed or weak memory models (i.e.,permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accessesExamples of processors with relaxed memory models:Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO): Membar #LoadLoad, Membar #LoadStoreMembar #StoreLoad, Membar #StoreStorePowerPC (WO): Sync, EIEIOMemory fences are expensive operations, however, one pays the cost of serialization only when it is required4/22/20087CS152-Spring’08Using Memory FencesProducer posting Item x:Load Rtail, (tail)Store (Rtail), xMembarSSRtail=Rtail+1Store (tail), RtailConsumer:Load Rhead, (head)spin: Load Rtail, (tail)if Rhead==Rtail goto spinMembarLLLoad R, (Rhead)Rhead=Rhead+1Store (head), Rheadprocess(R)ProducerConsumertail head RtailRtailRheadRensures that tail ptris not updated before x has been storedensures that R isnot loaded before x has been stored4/22/20088CS152-Spring’08Memory Consistency in SMPsSuppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter?What is the view of shared memory for programming?cache-1A 100CPU-Memory busCPU-1CPU-2cache-2A 100memoryA 1004/22/20089CS152-Spring’08Write-back Caches & SC• T1 is executed prog T2LD Y, R1ST Y’, R1LD X, R2ST X’,R2 prog T1 ST X, 1 ST Y,11cache-2cache-1 memory X = 0 Y =10 X’= Y’= X= 1 Y=11 Y = Y’= X = X’= • cache-1 writes back Y X = 0 Y =11 X’= Y’= X= 1 Y=11 Y = Y’= X = X’= X = 1 Y =11 X’= Y’= X= 1 Y=11 Y = 11 Y’= 11 X = 0 X’= 0• cache-1 writes back X X = 0 Y =11 X’= Y’= X= 1 Y=11 Y = 11 Y’= 11 X = 0 X’= 0• T2 executed X = 1 Y =11 X’= 0 Y’=11 X= 1 Y=11 Y =11 Y’=11 X = 0 X’= 0 • cache-2 writes back X’ & Y’inconsistent4/22/200810CS152-Spring’08Write-through Caches & SCcache-2 Y = Y’= X = 0 X’= memory X = 0 Y =10 X’= Y’=cache-1 X= 0 Y=10prog T2LD Y, R1ST Y’, R1LD X, R2ST X’,R2 prog T1 ST X, 1 ST Y,11Write-through caches don’t preserve sequential consistency either• T1 executed Y = Y’= X = 0 X’= X = 1 Y =11 X’= Y’= X= 1 Y=11• T2 executed Y = 11 Y’= 11 X = 0 X’= 0 X = 1 Y =11 X’= 0 Y’=11 X= 1 Y=114/22/200811CS152-Spring’08Maintaining Sequential ConsistencySC is sufficient for correct producer-consumerand mutual exclusion code (e.g., Dekker)Multiple copies of a location in various cachescan cause SC to break down.Hardware support is required such that• only one processor at a time has write permission for a location• no processor can load a stale copy of the location after a write cache coherence protocols4/22/200812CS152-Spring’08Cache Coherence Protocols for SCwrite request: the address is invalidated (updated) in all othercaches before (after) the write is performedread request: if a dirty copy is found in some cache, a write-back is performed before the memory is read We will focus on Invalidation protocols as opposed to Update protocols4/22/200813CS152-Spring’08Warmup: Parallel I/O (DMA stands for Direct Memory Access)Either Cache or DMA canbe the Bus Master andeffect transfers DISK DMAPhysicalMemoryProc. R/W Data (D)CacheAddress (A)ADR/W Page transfersoccur while theProcessor is runningMemory Bus4/22/200814CS152-Spring’08Problems with Parallel I/OMemory Disk: Physical
View Full Document