Memory Consistency Modelsfor Shared-Memory MultiprocessorsMotivation• Build large-scale shared-memory multiprocessors• High memory latency is a fundamental issue- Stanford DASH: 100-150 cycles- Kendall Square Research KSR1: 200-600 cycles• Caches reduce latency, but inherent communication remainsProcessorCacheMemoryProcessorCacheMemoryInterconnection NetworkHiding Memory Latency• Overlap memory accesses with other accesses and computation• Simple in uniprocessors• Can affect correctness in multiprocessorswrite Aread Bwrite Aread BOutline• Memory Consistency Models• Framework for Programming Simplicity• Performance Evaluation• ConclusionsUniprocessor Memory Model• Memory model specifies ordering constraints among accesses• Uniprocessor model: memory accesses atomic and in program order• Not necessary to maintain sequential order for correctness- hardware: buffering, pipelining- compiler: register allocation, code motion• Simple for programmers• Allows for high performancewrite Awrite Bread Aread BShared-Memory Multiprocessors• Order between accesses to different locations becomes importantA = 1;Flag = 1;wait (Flag == 1);... = A;P1P2How Unsafe Reordering Can Happen• Distribution of memory resources- accesses issued in order may be observed out of orderInterconnection NetworkProcessorMemoryA = 1;Flag = 1;Memorywait (Flag == 1);... = A;Flag: 0 1A: 0MemoryA = 1Flag = 1Processor Processor12Caches Complicate Things More• Multiple copies of the same locationA = 1;P1B = 1;P2 P3Interconnection NetworkProcessorMemoryMemoryA = 1ProcessorCacheMemoryProcessorCacheCacheA = 1B = 1A: 011B: 0A: 0wait (A==1);wait (B==1);... = A;Need for a Multiprocessor Memory Model• Provide reasonable ordering constraints on memory accesses- affects programmers- affects system designersMemory BehaviorWhat should the semantics be for memory operations to the shared memory?• ease-of-use: keep it similar to serial semantics for uniprocessor• operating system community used concurrent programming:• multiple processes interleaved on a single processor• Lamport (1979) formalized Sequential Consistency (SC):• “... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.”Sequential Consistency• Formalized by Lamport- accesses of each processor in program order- all accesses appear in sequential order• Any order implicitly assumed by programmer is maintainedMemoryP1 P2 PnExample with SCSimple Synchronization:P1 P2A = 1 (a)Flag =1 (b) x = Flag (c)y = A (d)• all locations are initialized to 0• possible outcomes for (x,y): (0,0), (0,1), (1,1)• (x,y) = (1,0) is not a possible outcome:• we know a->b and c->d by program order• b->c implies that a->d• y==0 implies d->a which leads to a contradictionAnother Example with SCFrom Dekker’s Algorithm:P1 P2A = 1 (a) B = 1 (c)x = B (b) y = A (d)• possible outcomes for (x,y): (0,1), (1,0), (1,1)• outcome (x,y) = (0,0) not possible:• a->b and c->d implied by program order• x = 0 implies b->c which implies a->d• a->d says y = 1 which leads to a contradiction• similarly, y = 0 implies x =1 which is also a contradictionHow to Guarantee SCSufficient Conditions for SC (Dubois et al., 1987):• assumes general cache coherence (if we have caches):• writes to the same location are observed in same order by all P’s• for each processor, delay issue of access until previous completes• a read completes when the return value is back• a write completes when new value is visible to all processors• for simplicity, we assume writes are atomicImportant to note that these are not necessary conditions for maintaining SCSimple Bus-Based Multiprocessor• assume write-back caches• general cache coherence maintained by serialization at bus• writes to same location serialized and observed in the same order by all• writes are atomic because all processors observe the write at the same time• accesses from a single process complete in program order:• cache is busy while servicing a miss, effectively delaying later access• SC is guaranteed without any extra mechanism above coherenceP1 PnCacheCacheMemExample of Complication in Bus-Based Machines• 1st level cache write-thru, 2nd level write-back (e.g.,SGI cluster in DASH)• write-buffer with no forwarding (reads to 2nd level delayed until buffer empty)• never hit in the 1st level cache: SC is maintained (same as previous slide)• read hits in the first level cache cause complication (e.g., Dekker’s algorithm)• to maintain SC, we need to delay access to 1st level until there are no writes pending in write buffer (full write latency observed by processor)• multiprocessors may not maintain SC to achieve higher performanceMemL1 CacheL2 CacheL1 CacheL2 Cachewrite-buffer write-bufferP1PnScalable Shared-Memory Multiprocessor• no more bus to serialize accesses• only order maintained by network is point-to-point• general cache coherence:• serialize at memory location; point-to-point order required• accesses issued in order do not necessarily complete in order:• due to distribution of memory and varied-length paths in network• writes are inherently non-atomic:• new value is visible to some while others can still see old value• no one point in the system where a write is completedP1CacheMemPnCacheMemInterconnection NetworkScalable Architectures (Cont’d)• need to know when a write completes:• for providing atomicity• for delaying an access until previous one completes• requires acknowledgement messages:• write is complete when all invalidations are acknowledged• use a counter to count the number of acknowledgements• ensuring atomicity for writes:• delay access to new value until all acknowledgments are back• can be done for invalidation-based schemes; unnatural for updates• ensuring order of accesses from a processor:• delay each access until the previous one completes• latencies are large (10’s to 100’s of cycles) and all latency seen by processorP1CacheMemPnCacheMemInterconnection NetworkSummary for Sequential Consistency• Maintain order between shared accesses in each process• Severely restricts common hardware and compiler optimizationsREADREADREADWRITEWRITEREADWRITEWRITEAlternatives to
View Full Document