Stanford CS 243 - CS 243- Lecture 13 Loop Transformations for Parallelism and Locality

Unformatted text preview:

Carnegie Mellon 1. Examples 2. Affine Partitioning: Do-all 3. Affine Partitioning: Pipelining Readings: Chapter 11–11.3, 11.6–11.7.4, 11.9-11.9.6 Lecture 13 Loop Transformations for Parallelism and Locality M. Lam 1 CS243: Loop TransformationsCarnegie Mellon Shared Memory Machines Performance on Shared Address Space Multiprocessors: Parallelism & Locality M. Lam 2 CS243: Loop TransformationsCarnegie Mellon Parallelism and Locality • Parallelism DOES NOT imply speed up! • Parallel performance: Improve locality with loop transformations – Minimize communication – Operations using the same data are executed on the same processor • Sequential performance: Improve locality with loop transformations – Minimize cache misses – Operations using the same data are executed close in time. M. Lam 3 CS243: Loop TransformationsCarnegie Mellon Loop Permutation (Loop Interchange) for J’ = 1 to 3 for I’ = 1 to 4 Z[I’,J’] = Z[I’-1,J’] for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I J J’ I’ € j'i'⎡ ⎣ ⎢ ⎤ ⎦ ⎥ =0 11 0⎡ ⎣ ⎢ ⎤ ⎦ ⎥ ij⎡ ⎣ ⎢ ⎤ ⎦ ⎥ M. Lam 4 CS243: Loop TransformationsCarnegie Mellon Loop Fusion for J = 1 to 4 T[J]= A[J]+B[J] (s1) C[J]= T[J] x T[J] (s2) for I = 1 to 4 T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 C[I’]= T[I’] x T[I’] (s2) € j[ ]= 1[ ][i]M. Lam 5 CS243: Loop Transformations I’ I € j[ ]= 1[ ][i']s1:!s2:!JCarnegie Mellon Affine Partitioning: An Contrived but Illustrative Example FOR j = 1 TO n FOR i = 1 TO n A[i,j] = A[i,j]+B[i-1,j]; (S1) B[i,j] = A[i,j-1]*B[i,j]; (S2) i j"S1 S2 M. Lam 6 CS243: Loop TransformationsCarnegie Mellon Best Parallelization Scheme SPMD code: Let p be the processor’s ID number if (1-n <= p <= n) then if [1 <= p) then B[p,1] = A[p,0] * B[p,1]; (S2) for i1 = max[1,1+p) to min[n,n-1+p) do A[i1,i1-p] = A[i1,i1-p] + B[i1-1,i1-p]; (S1) B[i1,i1-p+1] = A[i1,i1-p] * B[i1,i1-p+1]; (S2) if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S1) Algorithm finds affine partition mappings for each instruction: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1. M. Lam 7 CS243: Loop TransformationsCarnegie Mellon 2. Iteration Space FOR i = 0 to 5 FOR j = i to 7 … • n-deep loop nests: n-dimensional polytope • Iterations: coordinates in the iteration space • Assume: iteration index is incremented in the loop • Sequential execution order: lexicographic order – [0,0], [0,1], …, [0,6], [0,7], [1,1], …, [1,6], [1,7], … M. Lam 8 CS243: Loop TransformationsCarnegie Mellon Maximum Parallelism & No Communication For every pair of data dependent accesses F1i1+f1 and F2i2+f2 Find C1, c1, C2, c2: ∀ i1, i2 F1 i1+ f1 = F2 i2+f2 → C1i1+c1 = C2i2+c2 with the objective of maximizing the rank of C1, C2 Loops Array Processor ID F1i1+f1 F2i2+f2 C1i1+c1 C2i2+c2 M. Lam 9 CS243: Loop TransformationsCarnegie Mellon Rank of Partitioning = Degree of Parallelism Affine Mapping Rank 0 1 2 Mapped to same processor M. Lam 10 CS243: Loop TransformationsCarnegie Mellon Example 1: Loop Transform for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I J Find affine partitioning: c1, c2, c0! such that!!!!Suppose!itera0on!i,j!&!i’,!j’!refer!to!same!loca0on!!!!!!!!!!!!!!!!!!!!!!!!!!!i!!="i’!9! 1!!!!!!!!!!!!!!!!!!!!!!!!!!!!j!!="j’!No!communica0on!means:!!!!!!!c1!i!+!c2!j!+!c0!=!c1!i’!+!c2!j’!+!c0!!!!!!!c1(i’91)!+c2!j’!+!c0!=!c1!i’!+!c2!j’!+!c0!!!!!!!!!!!!!!!!!!!!!!!!!c1!=!0!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!p!=!c2!j!+!c0!Pick!simplest!c2,!c0:"c2!=!1,!c0=!0!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!p!=!j!!!€ p = c1c2[ ]ij⎡ ⎣ ⎢ ⎤ ⎦ ⎥ + c0M. Lam 11 CS243: Loop TransformationsCarnegie Mellon Code Generation • Naive – Each processor visits all the iterations – Executes only if it owns that iteration • Optimization – Removes unnecessary looping and condition evaluation M. Lam CS243: Loop Transformations 12Carnegie Mellon Code Generation for P = 1 to 3 for I = 1 to 4 for J = 1 to 3 if (j == P) Z[I,J] = Z[I-1,J] for P = 1 to 3 for I = 1 to 4 Z[I,P] = Z[I-1,P] for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I J p!=!j!SPMD (single program multiple data) code: for I = 1 to 4 Z[I,P] = Z[I-1,P] M. Lam 13 CS243: Loop TransformationsCarnegie Mellon Loop Permutation (Loop Interchange) for P = 1 to 3 for I = 1 to 4 Z[I,P] = Z[I-1,P] for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I J P I € p'i'⎡ ⎣ ⎢ ⎤ ⎦ ⎥ =0 11 0⎡ ⎣ ⎢ ⎤ ⎦ ⎥ ij⎡ ⎣ ⎢ ⎤ ⎦ ⎥ M. Lam 14 CS243: Loop TransformationsCarnegie Mellon Example 2: Loop Fusion for I = 1 to 4 T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 C[I’]= T[I’] x T[I’] (s2) M. Lam 15 CS243: Loop Transformations I’ I € p[ ]= c1,1[ ][i] + c1,0€ p[ ]= c2,1[ ][i'] + c2,0s1:!s2:!Find affine partitioning: c1,1, c1,0, c2.1, c1,0, ! such that!!!!Suppose!itera0on!i!&!i’!refer!to !the!same!loca0on!!!!!!!!!!!!!!!!!!!!!!!!!!!i!!="i’!!!No!communica0on!means:!!!!!!!!!!!!c1,1!i!+!c1,0!=!c2,1!i’!+!c2,0!!!!!!!!!!!!!!!!!!!!!!!!!!!c1,1!=!c2,1"!!!!!!!!!!!!!!!!!!!!!!!!c1,0=!c2,0!Pick!simplest!values:"c1,1!=!c2,1!=!1,!c1,0=!c2,0=!0!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!p!=!i;!p=i’!! !Carnegie Mellon Loop Fusion for P = 1 to 4 for I = 1 to 4 if (I == P) T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 if (I’ == P) C[I’]= T[I’] x T[I’] (s2) for P = 1 to 4 T[P]= A[P]+B[P] (s1) C[P]= T[P] x T[P] (s2) for I = 1 to 4 T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 C[I’]= T[I’] x T[I’] (s2) € p[ ]= 1[ ][i]M. Lam 16 CS243: Loop Transformations I’ I € p[ ]= 1[ ][i']s1:!s2:!JCarnegie Mellon Example 3: 2 Nested, Parallel Loops for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I,J]+1 I J Find affine partitioning: c1, c2, c0! such


View Full Document

Stanford CS 243 - CS 243- Lecture 13 Loop Transformations for Parallelism and Locality

Download CS 243- Lecture 13 Loop Transformations for Parallelism and Locality
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CS 243- Lecture 13 Loop Transformations for Parallelism and Locality and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CS 243- Lecture 13 Loop Transformations for Parallelism and Locality 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?