1Advanced CompilersM. Lam1.1.Loop Permutation as an ExampleLoop Permutation as an Example2.2.Seven Primitive TransformsSeven Primitive Transforms3.3.Advanced Topic: Pipelining & Advanced Topic: Pipelining & Blocking Blocking Readings: Chapter 11.1 Readings: Chapter 11.1 --11.711.7CS 243CS 243Lecture 13Lecture 13Affine TransformsAffine TransformsAdvanced CompilersL13. Affine PartitioningAffine PartitioningAffine PartitioningLoopsArrayProcessor IDF1i1+f1F2i2+f2C1i1+c1C2i2+c2For every pair of data dependent accesses F1i1+f1and F2i2+f2Find C1, c1, C2, c2: ∀∀∀∀ i1, i2 F1 i1+ f1= F2 i2+f2→ C1i1+c1= C2i2+c2with the objective of maximizing the rank of C1, C22Advanced CompilersL13. Affine Partitioning1. Example: 1. Example: Loop Interchange (Loop Permutation)Loop Interchange (Loop Permutation)=−−01''1001jjii[ ]0''21=−−jjiiCC−+=+01''1001001001jijifor I = 1 to 4for J = 1 to 3Z[I,J] = Z[I-1,J] IJData dependent operations:∀∀∀∀ i, j, i’, j’ such that, Find C1, C2, c for statement in loop[ ] [ ]cjiCCcjiCC +=+''2121[ ]00121=CC=−−01''jjii[][]1021=CCc = 0 [ ]cjip += 10p = jOne solution: Therefore:Advanced CompilersL13. Affine PartitioningCode Generation: Execute each partition in sequential orderCode Generation: Execute each partition in sequential orderMust be correct!Must be correct!for I = 1 to 4for J = 1 to 3Z[I,J] = Z[I-1,J]IJfor P = 1 to 3for I’ = 1 to 4for J’ = 1 to 3if (J’ = P)Z[I’,J’] = Z[I’-1,J’]for P = 1 to 3for I’ = 1 to 4Z[I’,P] = Z[I’-1,P]=jiip0110'3Advanced CompilersL13. Affine PartitioningGeometric InterpretationGeometric Interpretationfor P = 1 to 3for I’ = 1 to 4Z[I’,P] = Z[I’-1,P]for I = 1 to 4for J = 1 to 3Z[I,J] = Z[I-1,J]IJPI’=jiip0110'Advanced CompilersL13. Affine PartitioningAffine Partitioning Algorithm: Affine Partitioning Algorithm: Maximize Degree of Parallelism with No CommunicationMaximize Degree of Parallelism with No CommunicationFind data dependences For each pair of data dependent operationsSet up equations F1i1+f1= F2i2+f2to capture relations of dependent iterationsReduce the number of unknowns – lots of identities.Set up equations for affine partitions C1i1+c1= C2i2+c2Each operation gets its own C, c Work on C by dropping the c’, Rewrite constraints as Ax = 0, where x is the unknown CsNullity of A is the rank CFind the basis vectorsFind the constant c4Advanced CompilersL13. Affine Partitioning2. Primitive Loop Transformations2. Primitive Loop TransformationsKey idea: If you draw the dependence graph, you can eyeball the code and get the solution derived using linear algebraSeven source-level transformationsUnimodular transform: Reversal, permutation, skewingFusion, fission, re-indexing, scalingAffine partitioning can express arbitrary combinations of these seven primitives. Advanced CompilersL13. Affine PartitioningFusionFusionfor (i=1; i<=N; i++)Y[i] = Z[i]; /*s1*/for (j=1; j<=N; j++)X[j] = Y[j]; /*s2*/s1s2s1 : p = is2 : p = jfor (p=1; p<=N; p++) {Y[p] = Z[p]; X[p] = Y[p]; }Source Code Partition Transformed Code5Advanced CompilersL13. Affine PartitioningFissionFissionfor (i=1; i<=N; i++)Y[i] = Z[i]; /*s1*/for (j=1; j<=N; j++)X[j] = Y[j]; /*s2*/s1s2s1 : i = ps2 : j = pfor (p=1; p<=N; p++) {Y[p] = Z[p]; X[p] = Y[p]; }Source Code Partition Transformed CodeAdvanced CompilersL13. Affine PartitioningReRe--indexingindexingfor (i=1; i<=N; i++) {Y[i] = Z[i]; /*s1*/X[i] = Y[i-1]; /*s2*/}s1s2s1 : p = is2 : p = i – 1if (N>=1) X[1]=Y[0];for (p=1; p<=N; p++) {Y[p] = Z[p]; X[p+1] = Y[p]; }if (N>=1) Y[N]=Z[N];Source Code Partition Transformed Codes1s26Advanced CompilersL13. Affine PartitioningScalingScalingfor (i=1; i<=N; i++)Y[2*i] = Z[2*i]; /*s1*/for (j=1; j<=2*N; j++)X[j] = Y[j]; /*s2*/s1s2s1 : p = 2 × is2 : p = jfor (p=1; p<=2*N; p++){if (p mod 2 == 0)Y[p] = Z[p]; X[p] = Y[p]; }if (N>=1) Y[N]=Z[N];Source Code Partition Transformed Codes1s2Advanced CompilersL13. Affine PartitioningReversalReversalfor (i=0; i<=N; i++)Y[N-i] = Z[i]; /*s1*/for (j=0; j<=N; j++)X[j] = Y[j]; /*s2*/s1s2s1 : p = is2 : p = jfor (p=0; p<=N; p++) {Y[p] = Z[N-p]; X[p] = Y[p]; }Source Code Partition Transformed Code7Advanced CompilersL13. Affine PartitioningPermutationPermutationfor (i=1; i<=N; i++)for (j=0; j<=M; j++)Z[i,j] = Z[i-1,j];Source Code Partition Transformed Code=jiqp0110for (p=0; p<=M; p++)for (q=1; q<=N; q++)Z[q,p] = Z[q-1,p];Advanced CompilersL13. Affine PartitioningSkewingSkewingfor (i=1; i<=N+M-1; i++)for (j=max(1,i+N); j<=min(i,M); j++)Z[i,j] = Z[i-1,j-1];Source Code Partition Transformed Code+−=101011jiqpfor (p=1; p<=N; p++)for (q=1; q<=M; q++)Z[p,q-p] = Z[p-1,q-p-1];8Advanced CompilersL13. Affine Partitioning3. Advanced topic: Pipelining3. Advanced topic: PipeliningSOR (Successive OverSOR (Successive Over--Relaxation): An ExampleRelaxation): An Examplefor i = 0 TO m for j = 0 to nX[j+1] = 1/3 * (X[j] + X[j+1] + X[j+2])jiAdvanced CompilersL13. Affine PartitioningFinding the Maximum Degree of PipeliningFinding the Maximum Degree of PipeliningFor every pair of data dependent accesses F1i1+f1and F2i2+f2Let B1i1+b1≥ 0, B2i2+b2≥ 0 be the corresponding loop bound constraints,Find C1, c1, C2, c2: ∀∀∀∀ i1, i2 B1i1+ b1≥ 0, B2i2+ b2≥ 0(i1 ≤≤≤≤ i2 ) ∧∧∧∧ (F1 i1+ f1= F2 i2+f2) → C1i1+c1≤≤≤≤ C2i2+c2with the objective of maximizing the rank of C1, C2LoopsArrayTime StageF1i1+f1F2i2+f2C1i1+c1C2i2+c2i1 ≤≤≤≤ i29Advanced CompilersL13. Affine PartitioningKey InsightKey InsightChoice in time mapping => (pipelined) parallelismRank(C) – 1 degree of parallelism with 1 degree of synchronizationCan create blocks with Rank(C) dimensionsFind time partitions is not as straightforward as space partitionsNeed to deal with linear inequalitiesSolved using Farkas Lemma – no simple intuitive proofAdvanced CompilersL13. Affine
View Full Document