Carnegie Mellon 1. Examples 2. Affine Partitioning: Do-all 3. Affine Partitioning: Pipelining Readings: Chapter 11–11.3, 11.6–11.7.4, 11.9-11.9.6 Lecture 13 Loop Transformations for Parallelism and Locality M. Lam 1 CS243: Loop TransformationsCarnegie Mellon Shared Memory Machines Performance on Shared Address Space Multiprocessors: Parallelism & Locality M. Lam 2 CS243: Loop TransformationsCarnegie Mellon Parallelism and Locality • Parallelism DOES NOT imply speed up! • Parallel performance: Improve locality with loop transformations – Minimize communication – Operations using the same data are executed on the same processor • Sequential performance: Improve locality with loop transformations – Minimize cache misses – Operations using the same data are executed close in time. M. Lam 3 CS243: Loop TransformationsCarnegie Mellon Loop Permutation (Loop Interchange) for J’ = 1 to 3 for I’ = 1 to 4 Z[I’,J’] = Z[I’-1,J’] for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I J J’ I’ € j'i'⎡ ⎣ ⎢ ⎤ ⎦ ⎥ =0 11 0⎡ ⎣ ⎢ ⎤ ⎦ ⎥ ij⎡ ⎣ ⎢ ⎤ ⎦ ⎥ M. Lam 4 CS243: Loop TransformationsCarnegie Mellon Loop Fusion for J = 1 to 4 T[J]= A[J]+B[J] (s1) C[J]= T[J] x T[J] (s2) for I = 1 to 4 T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 C[I’]= T[I’] x T[I’] (s2) € j[ ]= 1[ ][i]M. Lam 5 CS243: Loop Transformations I’ I € j[ ]= 1[ ][i']s1:!s2:!JCarnegie Mellon Affine Partitioning: An Contrived but Illustrative Example FOR j = 1 TO n FOR i = 1 TO n A[i,j] = A[i,j]+B[i-1,j]; (S1) B[i,j] = A[i,j-1]*B[i,j]; (S2) i j"S1 S2 M. Lam 6 CS243: Loop TransformationsCarnegie Mellon Best Parallelization Scheme SPMD code: Let p be the processor’s ID number if (1-n <= p <= n) then if [1 <= p) then B[p,1] = A[p,0] * B[p,1]; (S2) for i1 = max[1,1+p) to min[n,n-1+p) do A[i1,i1-p] = A[i1,i1-p] + B[i1-1,i1-p]; (S1) B[i1,i1-p+1] = A[i1,i1-p] * B[i1,i1-p+1]; (S2) if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S1) Algorithm finds affine partition mappings for each instruction: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1. M. Lam 7 CS243: Loop TransformationsCarnegie Mellon 2. Iteration Space FOR i = 0 to 5 FOR j = i to 7 … • n-deep loop nests: n-dimensional polytope • Iterations: coordinates in the iteration space • Assume: iteration index is incremented in the loop • Sequential execution order: lexicographic order – [0,0], [0,1], …, [0,6], [0,7], [1,1], …, [1,6], [1,7], … M. Lam 8 CS243: Loop TransformationsCarnegie Mellon Maximum Parallelism & No Communication For every pair of data dependent accesses F1i1+f1 and F2i2+f2 Find C1, c1, C2, c2: ∀ i1, i2 F1 i1+ f1 = F2 i2+f2 → C1i1+c1 = C2i2+c2 with the objective of maximizing the rank of C1, C2 Loops Array Processor ID F1i1+f1 F2i2+f2 C1i1+c1 C2i2+c2 M. Lam 9 CS243: Loop TransformationsCarnegie Mellon Rank of Partitioning = Degree of Parallelism Affine Mapping Rank 0 1 2 Mapped to same processor M. Lam 10 CS243: Loop TransformationsCarnegie Mellon Example 1: Loop Transform for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I J Find affine partitioning: c1, c2, c0! such that!!!!Suppose!itera0on!i,j!&!i’,!j’!refer!to!same!loca0on!!!!!!!!!!!!!!!!!!!!!!!!!!!i!!="i’!9! 1!!!!!!!!!!!!!!!!!!!!!!!!!!!!j!!="j’!No!communica0on!means:!!!!!!!c1!i!+!c2!j!+!c0!=!c1!i’!+!c2!j’!+!c0!!!!!!!c1(i’91)!+c2!j’!+!c0!=!c1!i’!+!c2!j’!+!c0!!!!!!!!!!!!!!!!!!!!!!!!!c1!=!0!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!p!=!c2!j!+!c0!Pick!simplest!c2,!c0:"c2!=!1,!c0=!0!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!p!=!j!!!€ p = c1c2[ ]ij⎡ ⎣ ⎢ ⎤ ⎦ ⎥ + c0M. Lam 11 CS243: Loop TransformationsCarnegie Mellon Code Generation • Naive – Each processor visits all the iterations – Executes only if it owns that iteration • Optimization – Removes unnecessary looping and condition evaluation M. Lam CS243: Loop Transformations 12Carnegie Mellon Code Generation for P = 1 to 3 for I = 1 to 4 for J = 1 to 3 if (j == P) Z[I,J] = Z[I-1,J] for P = 1 to 3 for I = 1 to 4 Z[I,P] = Z[I-1,P] for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I J p!=!j!SPMD (single program multiple data) code: for I = 1 to 4 Z[I,P] = Z[I-1,P] M. Lam 13 CS243: Loop TransformationsCarnegie Mellon Loop Permutation (Loop Interchange) for P = 1 to 3 for I = 1 to 4 Z[I,P] = Z[I-1,P] for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I J P I € p'i'⎡ ⎣ ⎢ ⎤ ⎦ ⎥ =0 11 0⎡ ⎣ ⎢ ⎤ ⎦ ⎥ ij⎡ ⎣ ⎢ ⎤ ⎦ ⎥ M. Lam 14 CS243: Loop TransformationsCarnegie Mellon Example 2: Loop Fusion for I = 1 to 4 T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 C[I’]= T[I’] x T[I’] (s2) M. Lam 15 CS243: Loop Transformations I’ I € p[ ]= c1,1[ ][i] + c1,0€ p[ ]= c2,1[ ][i'] + c2,0s1:!s2:!Find affine partitioning: c1,1, c1,0, c2.1, c1,0, ! such that!!!!Suppose!itera0on!i!&!i’!refer!to !the!same!loca0on!!!!!!!!!!!!!!!!!!!!!!!!!!!i!!="i’!!!No!communica0on!means:!!!!!!!!!!!!c1,1!i!+!c1,0!=!c2,1!i’!+!c2,0!!!!!!!!!!!!!!!!!!!!!!!!!!!c1,1!=!c2,1"!!!!!!!!!!!!!!!!!!!!!!!!c1,0=!c2,0!Pick!simplest!values:"c1,1!=!c2,1!=!1,!c1,0=!c2,0=!0!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!p!=!i;!p=i’!! !Carnegie Mellon Loop Fusion for P = 1 to 4 for I = 1 to 4 if (I == P) T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 if (I’ == P) C[I’]= T[I’] x T[I’] (s2) for P = 1 to 4 T[P]= A[P]+B[P] (s1) C[P]= T[P] x T[P] (s2) for I = 1 to 4 T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 C[I’]= T[I’] x T[I’] (s2) € p[ ]= 1[ ][i]M. Lam 16 CS243: Loop Transformations I’ I € p[ ]= 1[ ][i']s1:!s2:!JCarnegie Mellon Example 3: 2 Nested, Parallel Loops for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I,J]+1 I J Find affine partitioning: c1, c2, c0! such
View Full Document