DOC PREVIEW
CMU CS 15740 - Lecture

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

write A read B Can affect correctness in multiprocessors Simple in uniprocessors read B write A Overlap memory accesses with other accesses and computation Hiding Memory Latency Memory Consistency Models for Shared Memory Multiprocessors Conclusions Performance Evaluation Framework for Programming Simplicity Memory Consistency Models Outline Caches reduce latency but inherent communication remains High memory latency is a fundamental issue Stanford DASH 100 150 cycles Kendall Square Research KSR1 200 600 cycles Build large scale shared memory multiprocessors Interconnection Network Memory Cache Cache Memory Processor Processor Motivation 2 A 1 Flag 1 Interconnection Network 1 Memory A 0 Memory Processor Flag 0 1 A Processor Flag 1 Distribution of memory resources accesses issued in order may be observed out of order Memory Processor wait Flag 1 A 1 How Unsafe Reordering Can Happen Allows for high performance Simple for programmers Not necessary to maintain sequential order for correctness hardware buffering pipelining compiler register allocation code motion write A write B read A read B Uniprocessor model memory accesses atomic and in program order Memory model specifies ordering constraints among accesses Uniprocessor Memory Model B 1 A 0 Interconnection Network A 1 Memory Memory A 1 Cache Processor B 1 wait A 1 Cache Processor A 1 P2 Multiple copies of the same location P1 A wait Flag 1 P2 1 Memory Cache Processor A B 0 A 0 wait B 1 P3 Caches Complicate Things More Flag 1 A 1 P1 1 Order between accesses to different locations becomes important Shared Memory Multiprocessors Memory P2 Pn Any order implicitly assumed by programmer is maintained P1 Formalized by Lamport accesses of each processor in program order all accesses appear in sequential order Sequential Consistency affects system designers affects programmers Provide reasonable ordering constraints on memory accesses Need for a Multiprocessor Memory Model x Flag y A P2 c d we know a b and c d by program order b c implies that a d y 0 implies d a which leads to a contradiction x y 1 0 is not a possible outcome possible outcomes for x y 0 0 0 1 1 1 all locations are initialized to 0 A 1 a Flag 1 b P1 Simple Synchronization Example with SC the result of any execution is the same as if the operations of all the processors were executed in some sequential order and the operations of each individual processor appear in this sequence in the order specified by its program Lamport 1979 formalized Sequential Consistency SC multiple processes interleaved on a single processor operating system community used concurrent programming ease of use keep it similar to serial semantics for uniprocessor What should the semantics be for memory operations to the shared memory Memory Behavior Mem L2 Cache write buffer L1 Cache Pn never hit in the 1st level cache SC is maintained same as previous slide read hits in the first level cache cause complication e g Dekker s algorithm to maintain SC we need to delay access to 1st level until there are no writes pending in write buffer full write latency observed by processor multiprocessors may not maintain SC to achieve higher performance accesses from a single process complete in program order cache is busy while servicing a miss effectively delaying later access SC is guaranteed without any extra mechanism above coherence 1st level cache write thru 2nd level write back e g SGI cluster in DASH write buffer with no forwarding reads to 2nd level delayed until buffer empty L2 Cache write buffer L1 Cache P1 Example of Complication in Bus Based Machines Important to note that these are not necessary conditions for maintaining SC for simplicity we assume writes are atomic writes are atomic because all processors observe the write at the same time general cache coherence maintained by serialization at bus writes to same location serialized and observed in the same order by all assume write back caches Cache Cache Mem Pn P1 Simple Bus Based Multiprocessor a b and c d implied by program order x 0 implies b c which implies a d a d says y 1 which leads to a contradiction similarly y 0 implies x 1 which is also a contradiction a write completes when new value is visible to all processors for each processor delay issue of access until previous completes writes to the same location are observed in same order by all P s assumes general cache coherence if we have caches outcome x y 0 0 not possible c d a read completes when the return value is back B 1 y A A 1 a x B b Sufficient Conditions for SC Dubois et al 1987 How to Guarantee SC possible outcomes for x y 0 1 1 0 1 1 P2 P1 From Dekker s Algorithm Another Example with SC Mem Mem WRITE READ READ WRITE WRITE WRITE Severely restricts common hardware and compiler optimizations READ READ Maintain order between shared accesses in each process Summary for Sequential Consistency writes are inherently non atomic new value is visible to some while others can still see old value no one point in the system where a write is completed accesses issued in order do not necessarily complete in order due to distribution of memory and varied length paths in network general cache coherence serialize at memory location point to point order required no more bus to serialize accesses only order maintained by network is point to point Cache Cache Interconnection Network Pn P1 Scalable Shared Memory Multiprocessor Mem Mem WRITE READ READ WRITE WRITE READ READ WRITE Partial Store Ordering PSO READ READ Processor Consistency PC Total Store Ordering TSO READ READ Relax constraints on memory order WRITE WRITE WRITE WRITE Alternatives to Sequential Consistency latencies are large 10 s to 100 s of cycles and all latency seen by processor need to know when a write completes for providing atomicity for delaying an access until previous one completes requires acknowledgement messages write is complete when all invalidations are acknowledged use a counter to count the number of acknowledgements ensuring atomicity for writes delay access to new value until all acknowledgments are back can be done for invalidation based schemes unnatural for updates ensuring order of accesses from a processor delay each access until the previous one completes Cache Cache Interconnection Network Pn P1 Scalable Architectures Cont d Specify methodology for writing safe programs Difficult to automatically determine orders that are not necessary B B lock L A unlock L unlock L A B 1 B 1 lock L A 1 Sufficient Order A 1


View Full Document

CMU CS 15740 - Lecture

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?