DOC PREVIEW
Berkeley COMPSCI 252 - Lecture Notes

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

EECS 252 Graduate Computer ArchitectureLec 15 – T1 (“Niagara”) and Papers DiscussionDavid PattersonElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/~pattrsnhttp://vlsi.cs.berkeley.edu/cs252-s06 3/13/2006 CS252 s06 T1 2Review• Caches contain all information on state of cached memory blocks • Snooping cache over shared medium for smaller MP by invalidating other cached copies on write• Sharing cached data ⇒ Coherence (values returned by a read), Consistency (when a written value will be returned by a read)• Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping => uniform memory access)• Directory has extra data structure to keep track of state of all cache blocks• Distributing directory => scalable shared address multiprocessor => Cache coherent, Non uniform memory access3/13/2006 CS252 s06 T1 3Outline• Consistency• Cross Cutting Issues• Fallacies and Pitfalls• Administrivia• Sun T1 (“Niagara”) Multiprocessor• 2 paper discussion3/13/2006 CS252 s06 T1 4Another MP Issue: Memory Consistency Models• What is consistency? When must a processor see the new value? e.g., seems thatP1: A = 0; P2: B = 0;..... .....A = 1; B = 1;L1: if (B == 0) ... L2: if (A == 0) ...• Impossible for both if statements L1 & L2 to be true?– What if write invalidate is delayed & processor continues?• Memory consistency models: what are the rules for such cases?• Sequential consistency: result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved ⇒ assignments before ifs above– SC: delay all memory accesses until all invalidates done3/13/2006 CS252 s06 T1 5Memory Consistency Model• Schemes faster execution to sequential consistency• Not an issue for most programs; they are synchronized– A program is synchronized if all access to shared data are ordered by synchronization operationswrite (x)...release (s) {unlock}...acquire (s) {lock}...read(x)• Only those programs willing to be nondeterministic are not synchronized: “data race”: outcome f(proc. speed)• Several Relaxed Models for Memory Consistency since most programs are synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW to different addresses3/13/2006 CS252 s06 T1 6Relaxed Consistency Models: The Basics• Key idea: allow reads and writes to complete out of order, but to use synchronization operations to enforce ordering, so that a synchronized program behaves as if the processor were sequentially consistent – By relaxing orderings, may obtain performance advantages – Also specifies range of legal compiler optimizations on shared data– Unless synchronization points are clearly defined and programs are synchronized, compiler could not interchange read and write of 2 shared data items because might affect the semantics of the program• 3 major sets of relaxed orderings:1. W→R ordering (all writes completed before next read) • Because retains ordering among writes, many programs that operate under sequential consistency operate under this model, without additional synchronization. Called processor consistency2. W → W ordering (all writes completed before next write) 3. R → W and R → R orderings, a variety of models depending on ordering restrictions and how synchronization operations enforce ordering• Many complexities in relaxed consistency models; defining precisely what it means for a write to complete; deciding when processors can see values that it has written3/13/2006 CS252 s06 T1 7Mark Hill observation• Instead, use speculation to hide latency from strict consistency model– If processor receives invalidation for memory reference before it is committed, processor uses speculation recovery to back out computation and restart with invalidated memory reference1. Aggressive implementation of sequential consistency or processor consistency gains most of advantage of more relaxed models2. Implementation adds little to implementation cost of speculative processor3. Allows the programmer to reason using the simpler programming models3/13/2006 CS252 s06 T1 8Cross Cutting Issues: Performance Measurement of Parallel Processors• Performance: how well scale as increase Proc• Speedup fixed as well as scaleup of problem– Assume benchmark of size n on p processors makes sense: how scale benchmark to run on m * p processors?– Memory-constrained scaling: keeping the amount of memory used per processor constant– Time-constrainedscaling: keeping total execution time, assuming perfect speedup, constant• Example: 1 hour on 10 P, time ~ O(n3), 100 P? – Time-constrained scaling: 1 hour ⇒ 101/3n ⇒ 2.15n scale up– Memory-constrained scaling: 10n size ⇒ 103/10 ⇒ 100X or 100 hours! 10X processors for 100X longer???– Need to know application well to scale: # iterations, error tolerance3/13/2006 CS252 s06 T1 9Fallacy: Amdahl’s Law doesn’t apply to parallel computers• Since some part linear, can’t go 100X?• 1987 claim to break it, since 1000X speedup– researchers scaled the benchmark to have a data set size that is 1000 times larger and compared the uniprocessor and parallel execution times of the scaled benchmark. For this particular algorithm the sequential portion of the program was constant independent of the size of the input, and the rest was fully parallel—hence, linear speedup with 1000 processors• Usually sequential scale with data too3/13/2006 CS252 s06 T1 10Fallacy: Linear speedups are needed to make multiprocessors cost-effective• Mark Hill & David Wood 1995 study• Compare costs SGI uniprocessor and MP• Uniprocessor = $38,400 + $100 * MB• MP = $81,600 + $20,000 * P + $100 * MB• 1 GB, uni = $138k v. mp = $181k + $20k * P• What speedup for better MP cost performance?• 8 proc = $341k; $341k/138k ⇒ 2.5X• 16 proc ⇒ need only 3.6X, or 25% linear speedup• Even if need some more memory for MP, not linear3/13/2006 CS252 s06 T1 11Fallacy: Scalability is almost free• “build scalability into a multiprocessor and then simply offer the multiprocessor at any point on the scale from a small number of processors to a large number”• Cray T3E scales to 2048 CPUs vs. 4 CPU Alpha – At 128 CPUs, it delivers a peak bisection BW of 38.4 GB/s, or 300 MB/s per CPU (uses Alpha microprocessor)– Compaq Alphaserver ES40 up to 4


View Full Document

Berkeley COMPSCI 252 - Lecture Notes

Documents in this Course
Quiz

Quiz

9 pages

Caches I

Caches I

46 pages

Lecture 6

Lecture 6

36 pages

Lecture 9

Lecture 9

52 pages

Figures

Figures

26 pages

Midterm

Midterm

15 pages

Midterm

Midterm

14 pages

Midterm I

Midterm I

15 pages

ECHO

ECHO

25 pages

Quiz  1

Quiz 1

12 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?