write A read B Can affect correctness in multiprocessors Simple in uniprocessors read B write A Overlap memory accesses with other accesses and computation Hiding Memory Latency Memory Consistency Models for Shared Memory Multiprocessors Conclusions Performance Evaluation Framework for Programming Simplicity Memory Consistency Models Outline Caches reduce latency but inherent communication remains High memory latency is a fundamental issue Stanford DASH 100 150 cycles Kendall Square Research KSR1 200 600 cycles Build large scale shared memory multiprocessors Interconnection Network Memory Cache Cache Memory Processor Processor Motivation 2 A 1 Flag 1 Interconnection Network 1 Memory A 0 Memory Processor Flag 0 1 A Processor Flag 1 Distribution of memory resources accesses issued in order may be observed out of order Memory Processor wait Flag 1 A 1 How Unsafe Reordering Can Happen Allows for high performance Simple for programmers Not necessary to maintain sequential order for correctness hardware buffering pipelining compiler register allocation code motion write A write B read A read B Uniprocessor model memory accesses atomic and in program order Memory model specifies ordering constraints among accesses Uniprocessor Memory Model B 1 A 0 Interconnection Network A 1 Memory Memory A 1 Cache Processor B 1 wait A 1 Cache Processor A 1 P2 Multiple copies of the same location P1 A wait Flag 1 P2 1 Memory Cache Processor A B 0 A 0 wait B 1 P3 Caches Complicate Things More Flag 1 A 1 P1 1 Order between accesses to different locations becomes important Shared Memory Multiprocessors Memory P2 Pn Any order implicitly assumed by programmer is maintained P1 Formalized by Lamport accesses of each processor in program order all accesses appear in sequential order Sequential Consistency affects system designers affects programmers Provide reasonable ordering constraints on memory accesses Need for a Multiprocessor Memory Model x Flag y A P2 c d we know a b and c d by program order b c implies that a d y 0 implies d a which leads to a contradiction x y 1 0 is not a possible outcome possible outcomes for x y 0 0 0 1 1 1 all locations are initialized to 0 A 1 a Flag 1 b P1 Simple Synchronization Example with SC the result of any execution is the same as if the operations of all the processors were executed in some sequential order and the operations of each individual processor appear in this sequence in the order specified by its program Lamport 1979 formalized Sequential Consistency SC multiple processes interleaved on a single processor operating system community used concurrent programming ease of use keep it similar to serial semantics for uniprocessor What should the semantics be for memory operations to the shared memory Memory Behavior Mem L2 Cache write buffer L1 Cache Pn never hit in the 1st level cache SC is maintained same as previous slide read hits in the first level cache cause complication e g Dekker s algorithm to maintain SC we need to delay access to 1st level until there are no writes pending in write buffer full write latency observed by processor multiprocessors may not maintain SC to achieve higher performance accesses from a single process complete in program order cache is busy while servicing a miss effectively delaying later access SC is guaranteed without any extra mechanism above coherence 1st level cache write thru 2nd level write back e g SGI cluster in DASH write buffer with no forwarding reads to 2nd level delayed until buffer empty L2 Cache write buffer L1 Cache P1 Example of Complication in Bus Based Machines Important to note that these are not necessary conditions for maintaining SC for simplicity we assume writes are atomic writes are atomic because all processors observe the write at the same time general cache coherence maintained by serialization at bus writes to same location serialized and observed in the same order by all assume write back caches Cache Cache Mem Pn P1 Simple Bus Based Multiprocessor a b and c d implied by program order x 0 implies b c which implies a d a d says y 1 which leads to a contradiction similarly y 0 implies x 1 which is also a contradiction a write completes when new value is visible to all processors for each processor delay issue of access until previous completes writes to the same location are observed in same order by all P s assumes general cache coherence if we have caches outcome x y 0 0 not possible c d a read completes when the return value is back B 1 y A A 1 a x B b Sufficient Conditions for SC Dubois et al 1987 How to Guarantee SC possible outcomes for x y 0 1 1 0 1 1 P2 P1 From Dekker s Algorithm Another Example with SC Mem Mem WRITE READ READ WRITE WRITE WRITE Severely restricts common hardware and compiler optimizations READ READ Maintain order between shared accesses in each process Summary for Sequential Consistency writes are inherently non atomic new value is visible to some while others can still see old value no one point in the system where a write is completed accesses issued in order do not necessarily complete in order due to distribution of memory and varied length paths in network general cache coherence serialize at memory location point to point order required no more bus to serialize accesses only order maintained by network is point to point Cache Cache Interconnection Network Pn P1 Scalable Shared Memory Multiprocessor Mem Mem WRITE READ READ WRITE WRITE READ READ WRITE Partial Store Ordering PSO READ READ Processor Consistency PC Total Store Ordering TSO READ READ Relax constraints on memory order WRITE WRITE WRITE WRITE Alternatives to Sequential Consistency latencies are large 10 s to 100 s of cycles and all latency seen by processor need to know when a write completes for providing atomicity for delaying an access until previous one completes requires acknowledgement messages write is complete when all invalidations are acknowledged use a counter to count the number of acknowledgements ensuring atomicity for writes delay access to new value until all acknowledgments are back can be done for invalidation based schemes unnatural for updates ensuring order of accesses from a processor delay each access until the previous one completes Cache Cache Interconnection Network Pn P1 Scalable Architectures Cont d Specify methodology for writing safe programs Difficult to automatically determine orders that are not necessary B B lock L A unlock L unlock L A B 1 B 1 lock L A 1 Sufficient Order A 1
View Full Document
Unlocking...