Unformatted text preview:

Spring 2010Spring 2010 Parallelization Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technologyc eas a a e at o O tu t esOutline • Why ParallelismWhy Parallelism • Parallel Execution • Parallelizing Compilers D d A l i• Dependence Analysis •Increasing Parallelization Opportunitiesg ppoMoore’s LawItaniumItanium 21,000,000,000From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006X-11/780) 52%/year??%/yearP2P3P410,000,000100,000,000Numbemance (vs. VAX386486PentiumP21,000,000er of Transistors11978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016Perfor25%/year8086286100,00010,000sSaman Amarasinghe 3 6.035 ©MIT Fall 2006From David PattersonUniprocessor Performance (SPECint)ItaniumItanium 2p()1,000,000,000From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006P2P3P410,000,000100,000,000Numbe386486PentiumP21,000,000er of Transistors8086286100,00010,000sSaman Amarasinghe 4 6.035 ©MIT Fall 2006From David PattersonMulticores Are Here! 512512 Picochip Ambric Pi hi PC102 AM2045 256 Cisco CSR-1 128 Intel TflopsTflops 64 32 Raza Cavium # of#of Raw Raw XLRXLR OtOcteon 16 cores 8 Niagara Cell 4 Boardcom 1480 Boardcom 1480 Opteron 4P4 Xeon MP Xbox360 2 Power4 PA-8800 Opteron Tanglewood PExtreme Power6Yonah 1 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium P4P41 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? Saman Amarasinghe 5 6.035 ©MIT Fall 2006Issues with Parallelism • Amdhal’s Law – Any computation can be analyzed in terms of a portion that must be executed sequentially Ts and a portion that can be must be executed sequentially, Ts, and a portion that can be executed in parallel, Tp. Then for n processors: – T(n) = Ts + Tp/n – T() = Ts, thus maximum speedup (Ts + Tp) /Ts T() Ts, thus maximum speedup (Ts + Tp) /Ts • Load Balancing – The work is distributed among processors so that allprocessors are kept busy when parallel task is executed. • Granularity – The size of the parallel regions between synchronizations or the ratio of computation (useful work) to communication Saman Amarasinghe 6 6.035 ©MIT Fall 2006 the ratio of computation (useful work) to communication (overhead).c eas a a e at o O tu t esOutline • Why ParallelismWhy Parallelism • Parallel Execution • Parallelizing Compilers D d A l i• Dependence Analysis •Increasing Parallelization Opportunitiesg ppo•Types of Parallelismyp• Instruction Level P ll li (ILP)  Scheduling and Hardware Parallelism (ILP) • Task Level Parallelism  Scheduling and Hardware  Mainly by hand (TLP) • Loop Level Parallelism • Loop Level Parallelism (LLP) or Data Parallelism  Hand or Compiler Generated • Pipeline Parallelism • Divide and Conquer  Hardware or Streaming  Recursive functions Saman Amarasinghe 8 6.035 ©MIT Fall 2006 Divide and Conquer Parallelism  Recursive functionsWhy Loops?y p • 90% of the execution time in 10% of the code – Mostly in loops • If parallel, can get good performance – Load balancing • Relatively easy to analyze Saman Amarasinghe 9 6.035 ©MIT Fall 2006Programmer Defined Parallel Loopg p •FORALL •FORACROSS – No “loop carried – Some “loop carried dependences” dependences” Fully parallel – Fully parallel Saman Amarasinghe 10 6.035 ©MIT Fall 2006Parallel Execution •Example FORPAR I = 0toN FORPAR I 0 to N A[I] = A[I] + 1 • Block Distribution: Program gets mapped into /Iters = ceiling(N/NUMPROC);FOR P = 0 to NUMPROC-1 FOR I = P*Iters to MIN((P+1)*Iters, N)A[I] = A[I] + 1 • SPMD (Single Program, Multiple Data) Code If(myPid == 0) { …… Iters = ceiling(N/NUMPROC);}Barrier();FOR I = myPid*Iters to MIN((myPid+1)*Iters, N) Saman Amarasinghe 11 6.035 ©MIT Fall 2006 FOR I myPid Iters to MIN((myPid+1) Iters, N)A[I] = A[I] + 1Barrier();Parallel Execution •Example FORPAR I = 0toN FORPAR I 0 to N A[I] = A[I] + 1 • Block Distribution: Program gets mapped into /Iters = ceiling(N/NUMPROC);FOR P = 0 to NUMPROC-1 FOR I = P*Iters to MIN((P+1)*Iters, N)A[I] = A[I] + 1 • Code fork a function Iters = ceiling(N/NUMPROC);ParallelExecute(func1);ParallelExecute(func1); … void func1(integer myPid){ FOR I = myPid*Iters to MIN((myPid+1)*Iters,N) Saman Amarasinghe 12 6.035 ©MIT Fall 2006 FOR I myPid Iters to MIN((myPid+1) Iters, N)A[I] = A[I] + 1}Stac ate o S a edParallel Execution •SPMD – Need to get all the processors execute the control flow Et h i ti h d d d t• Extra synchronization overhead or redundant computation on all processors or both – Stack: Private or Shared? •Fork – Local variables not visible within the function • Either make the variables used/defined in the loop body global or pass and return them as arguments Saman Amarasinghe 13 6.035 ©MIT Fall 2006 body global or pass and return them as arguments • Function call overheadParallel Thread Basics • Create separate threads – Create an OS threadCreate an OS thread • (hopefully) it will be run on a separate core – pthread_create(&thr, NULL, &entry_point, NULL) – Overhead in thread creation • Create a separate stack • Get the OS to allocate a threadGet the OS to allocate a thread • Thread pool – Create all the threads (= num cores) at the beginning( ) g g – Keep N-1 idling on a barrier, while sequential execution – Get them to run parallel code by each executing a function – Back to the barrier when parallel region is done Saman Amarasinghe 14 6.035 ©MIT Fall 2006c eas a a e at o O tu t esOutline • Why ParallelismWhy Parallelism • Parallel Execution • Parallelizing Compilers D d A l i• Dependence Analysis •Increasing Parallelization Opportunitiesg ppo=Parallelizing Compilersg p •Finding FORALL Loops out of FOR loopsg p p • ExamplesExamples FOR I = 0 to 5 A[I] = A[I] + 1 FOR I = 0 to 5 A[I] = A[I+6] + 1 For I = 0 to 5 A[2*I] = A[2*I+1]+1 Saman Amarasinghe 16 6.035 ©MIT Fall 2006 A[2 I] A[2 I + 1] + 1–=Iteration Spacep• N deep loops  n-dimensional discrete cartesian space – Normalized loops: assume step size = 1 012 34567  J i [i1, i2, i3,…, in] FOR I = 0 to 6 FOR J = I to 7 0 1 2 3 4 5 6 7  J 0 1 22 I  3 4 5 • Iterations are represented as coordinates in iteration space – i =[i i i i ]


View Full Document

MIT 6 035 - Parallelization

Download Parallelization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Parallelization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Parallelization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?