Unformatted text preview:

Memory Consistency Models for Shared Memory Multiprocessors Motivation Processor Processor Cache Cache Memory Memory Interconnection Network Build large scale shared memory multiprocessors High memory latency is a fundamental issue Stanford DASH 100 150 cycles Kendall Square Research KSR1 200 600 cycles Caches reduce latency but inherent communication remains Hiding Memory Latency Overlap memory accesses with other accesses and computation write A write A read B read B Simple in uniprocessors Can affect correctness in multiprocessors Outline Memory Consistency Models Framework for Programming Simplicity Performance Evaluation Conclusions Uniprocessor Memory Model Memory model specifies ordering constraints among accesses Uniprocessor model memory accesses atomic and in program order write A write B read A read B Not necessary to maintain sequential order for correctness hardware buffering pipelining compiler register allocation code motion Simple for programmers Allows for high performance Shared Memory Multiprocessors Order between accesses to different locations becomes important P1 P2 A 1 Flag 1 wait Flag 1 A How Unsafe Reordering Can Happen A 1 Processor wait Flag 1 Flag 1 Processor Memory Memory 1 A Processor A 0 Memory Flag 0 A 1 2 Flag 1 Interconnection Network Distribution of memory resources accesses issued in order may be observed out of order 1 Caches Complicate Things More Multiple copies of the same location P2 P1 A 1 P3 wait A 1 B 1 wait B 1 A Processor Processor Cache Cache Memory Memory A 1 Processor A 0 1 A 0 Cache Memory B 1 A 1 Interconnection Network B 0 1 Need for a Multiprocessor Memory Model Provide reasonable ordering constraints on memory accesses affects programmers affects system designers Memory Behavior What should the semantics be for memory operations to the shared memory ease of use keep it similar to serial semantics for uniprocessor operating system community used concurrent programming multiple processes interleaved on a single processor Lamport 1979 formalized Sequential Consistency SC the result of any execution is the same as if the operations of all the processors were executed in some sequential order and the operations of each individual processor appear in this sequence in the order specified by its program Sequential Consistency Formalized by Lamport accesses of each processor in program order all accesses appear in sequential order P1 P2 Pn Memory Any order implicitly assumed by programmer is maintained Example with SC Simple Synchronization P1 A 1 a Flag 1 b P2 x Flag y A c d all locations are initialized to 0 possible outcomes for x y 0 0 0 1 1 1 x y 1 0 is not a possible outcome we know a b and c d by program order b c implies that a d y 0 implies d a which leads to a contradiction Another Example with SC From Dekker s Algorithm P1 P2 A 1 a x B b B 1 y A c d possible outcomes for x y 0 1 1 0 1 1 outcome x y 0 0 not possible a b and c d implied by program order x 0 implies b c which implies a d a d says y 1 which leads to a contradiction similarly y 0 implies x 1 which is also a contradiction How to Guarantee SC Sufficient Conditions for SC Dubois et al 1987 assumes general cache coherence if we have caches writes to the same location are observed in same order by all P s for each processor delay issue of access until previous completes a read completes when the return value is back a write completes when new value is visible to all processors for simplicity we assume writes are atomic Important to note that these are not necessary conditions for maintaining SC Simple Bus Based Multiprocessor P1 Pn Cache Cache Mem assume write back caches general cache coherence maintained by serialization at bus writes to same location serialized and observed in the same order by all writes are atomic because all processors observe the write at the same time accesses from a single process complete in program order cache is busy while servicing a miss effectively delaying later access SC is guaranteed without any extra mechanism above coherence Example of Complication in Bus Based Machines P1 Pn L1 Cache L1 Cache write buffer write buffer L2 Cache L2 Cache Mem 1st level cache write thru 2nd level write back e g SGI cluster in DASH write buffer with no forwarding reads to 2nd level delayed until buffer empty never hit in the 1st level cache SC is maintained same as previous slide read hits in the first level cache cause complication e g Dekker s algorithm to maintain SC we need to delay access to 1st level until there are no writes pending in write buffer full write latency observed by processor multiprocessors may not maintain SC to achieve higher performance Scalable Shared Memory Multiprocessor P1 Pn Cache Cache Mem Mem Interconnection Network no more bus to serialize accesses only order maintained by network is point to point general cache coherence serialize at memory location point to point order required accesses issued in order do not necessarily complete in order due to distribution of memory and varied length paths in network writes are inherently non atomic new value is visible to some while others can still see old value no one point in the system where a write is completed Scalable Architectures Cont d P1 Pn Cache Cache Mem Mem Interconnection Network need to know when a write completes for providing atomicity for delaying an access until previous one completes requires acknowledgement messages write is complete when all invalidations are acknowledged use a counter to count the number of acknowledgements ensuring atomicity for writes delay access to new value until all acknowledgments are back can be done for invalidation based schemes unnatural for updates ensuring order of accesses from a processor delay each access until the previous one completes latencies are large 10 s to 100 s of cycles and all latency seen by processor Summary for Sequential Consistency Maintain order between shared accesses in each process READ READ WRITE WRITE READ WRITE READ WRITE Severely restricts common hardware and compiler optimizations Alternatives to Sequential Consistency Relax constraints on memory order READ READ WRITE WRITE READ WRITE READ WRITE Processor Consistency PC Total Store Ordering TSO READ READ WRITE WRITE READ WRITE READ WRITE Partial Store Ordering PSO Relaxed Models Processor consistency PC Goodman 89 Total store ordering TSO Sindhu 90 Causal memory Hutto 90 PRAM Lipton 90 Partial store ordering PSO Sindhu 90 Weak ordering WO Dubois 86


View Full Document

CMU CS 15740 - Memory Consistency Models for Shared-Memory Multiprocessors

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Loading Unlocking...
Login

Join to view Memory Consistency Models for Shared-Memory Multiprocessors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Memory Consistency Models for Shared-Memory Multiprocessors and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?