Spring 2010Spring 2010 Parallelization Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technologyc eas a a e at o O tu t esOutline • Why ParallelismWhy Parallelism • Parallel Execution • Parallelizing Compilers D d A l i• Dependence Analysis •Increasing Parallelization Opportunitiesg ppoMoore’s LawItaniumItanium 21,000,000,000From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006X-11/780) 52%/year??%/yearP2P3P410,000,000100,000,000Numbemance (vs. VAX386486PentiumP21,000,000er of Transistors11978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016Perfor25%/year8086286100,00010,000sSaman Amarasinghe 3 6.035 ©MIT Fall 2006From David PattersonUniprocessor Performance (SPECint)ItaniumItanium 2p()1,000,000,000From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006P2P3P410,000,000100,000,000Numbe386486PentiumP21,000,000er of Transistors8086286100,00010,000sSaman Amarasinghe 4 6.035 ©MIT Fall 2006From David PattersonMulticores Are Here! 512512 Picochip Ambric Pi hi PC102 AM2045 256 Cisco CSR-1 128 Intel TflopsTflops 64 32 Raza Cavium # of#of Raw Raw XLRXLR OtOcteon 16 cores 8 Niagara Cell 4 Boardcom 1480 Boardcom 1480 Opteron 4P4 Xeon MP Xbox360 2 Power4 PA-8800 Opteron Tanglewood PExtreme Power6Yonah 1 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium P4P41 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20?? Saman Amarasinghe 5 6.035 ©MIT Fall 2006Issues with Parallelism • Amdhal’s Law – Any computation can be analyzed in terms of a portion that must be executed sequentially Ts and a portion that can be must be executed sequentially, Ts, and a portion that can be executed in parallel, Tp. Then for n processors: – T(n) = Ts + Tp/n – T() = Ts, thus maximum speedup (Ts + Tp) /Ts T() Ts, thus maximum speedup (Ts + Tp) /Ts • Load Balancing – The work is distributed among processors so that allprocessors are kept busy when parallel task is executed. • Granularity – The size of the parallel regions between synchronizations or the ratio of computation (useful work) to communication Saman Amarasinghe 6 6.035 ©MIT Fall 2006 the ratio of computation (useful work) to communication (overhead).c eas a a e at o O tu t esOutline • Why ParallelismWhy Parallelism • Parallel Execution • Parallelizing Compilers D d A l i• Dependence Analysis •Increasing Parallelization Opportunitiesg ppo•Types of Parallelismyp• Instruction Level P ll li (ILP) Scheduling and Hardware Parallelism (ILP) • Task Level Parallelism Scheduling and Hardware Mainly by hand (TLP) • Loop Level Parallelism • Loop Level Parallelism (LLP) or Data Parallelism Hand or Compiler Generated • Pipeline Parallelism • Divide and Conquer Hardware or Streaming Recursive functions Saman Amarasinghe 8 6.035 ©MIT Fall 2006 Divide and Conquer Parallelism Recursive functionsWhy Loops?y p • 90% of the execution time in 10% of the code – Mostly in loops • If parallel, can get good performance – Load balancing • Relatively easy to analyze Saman Amarasinghe 9 6.035 ©MIT Fall 2006Programmer Defined Parallel Loopg p •FORALL •FORACROSS – No “loop carried – Some “loop carried dependences” dependences” Fully parallel – Fully parallel Saman Amarasinghe 10 6.035 ©MIT Fall 2006Parallel Execution •Example FORPAR I = 0toN FORPAR I 0 to N A[I] = A[I] + 1 • Block Distribution: Program gets mapped into /Iters = ceiling(N/NUMPROC);FOR P = 0 to NUMPROC-1 FOR I = P*Iters to MIN((P+1)*Iters, N)A[I] = A[I] + 1 • SPMD (Single Program, Multiple Data) Code If(myPid == 0) { …… Iters = ceiling(N/NUMPROC);}Barrier();FOR I = myPid*Iters to MIN((myPid+1)*Iters, N) Saman Amarasinghe 11 6.035 ©MIT Fall 2006 FOR I myPid Iters to MIN((myPid+1) Iters, N)A[I] = A[I] + 1Barrier();Parallel Execution •Example FORPAR I = 0toN FORPAR I 0 to N A[I] = A[I] + 1 • Block Distribution: Program gets mapped into /Iters = ceiling(N/NUMPROC);FOR P = 0 to NUMPROC-1 FOR I = P*Iters to MIN((P+1)*Iters, N)A[I] = A[I] + 1 • Code fork a function Iters = ceiling(N/NUMPROC);ParallelExecute(func1);ParallelExecute(func1); … void func1(integer myPid){ FOR I = myPid*Iters to MIN((myPid+1)*Iters,N) Saman Amarasinghe 12 6.035 ©MIT Fall 2006 FOR I myPid Iters to MIN((myPid+1) Iters, N)A[I] = A[I] + 1}Stac ate o S a edParallel Execution •SPMD – Need to get all the processors execute the control flow Et h i ti h d d d t• Extra synchronization overhead or redundant computation on all processors or both – Stack: Private or Shared? •Fork – Local variables not visible within the function • Either make the variables used/defined in the loop body global or pass and return them as arguments Saman Amarasinghe 13 6.035 ©MIT Fall 2006 body global or pass and return them as arguments • Function call overheadParallel Thread Basics • Create separate threads – Create an OS threadCreate an OS thread • (hopefully) it will be run on a separate core – pthread_create(&thr, NULL, &entry_point, NULL) – Overhead in thread creation • Create a separate stack • Get the OS to allocate a threadGet the OS to allocate a thread • Thread pool – Create all the threads (= num cores) at the beginning( ) g g – Keep N-1 idling on a barrier, while sequential execution – Get them to run parallel code by each executing a function – Back to the barrier when parallel region is done Saman Amarasinghe 14 6.035 ©MIT Fall 2006c eas a a e at o O tu t esOutline • Why ParallelismWhy Parallelism • Parallel Execution • Parallelizing Compilers D d A l i• Dependence Analysis •Increasing Parallelization Opportunitiesg ppo=Parallelizing Compilersg p •Finding FORALL Loops out of FOR loopsg p p • ExamplesExamples FOR I = 0 to 5 A[I] = A[I] + 1 FOR I = 0 to 5 A[I] = A[I+6] + 1 For I = 0 to 5 A[2*I] = A[2*I+1]+1 Saman Amarasinghe 16 6.035 ©MIT Fall 2006 A[2 I] A[2 I + 1] + 1–=Iteration Spacep• N deep loops n-dimensional discrete cartesian space – Normalized loops: assume step size = 1 012 34567 J i [i1, i2, i3,…, in] FOR I = 0 to 6 FOR J = I to 7 0 1 2 3 4 5 6 7 J 0 1 22 I 3 4 5 • Iterations are represented as coordinates in iteration space – i =[i i i i ]
View Full Document