DOC PREVIEW
Stanford CS 243 - Lecture Notes

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Advanced CompilersM. Lam1.1.Loop Permutation as an ExampleLoop Permutation as an Example2.2.Seven Primitive TransformsSeven Primitive Transforms3.3.Advanced Topic: Pipelining & Advanced Topic: Pipelining & Blocking Blocking Readings: Chapter 11.1 Readings: Chapter 11.1 --11.711.7CS 243CS 243Lecture 13Lecture 13Affine TransformsAffine TransformsAdvanced CompilersL13. Affine PartitioningAffine PartitioningAffine PartitioningLoopsArrayProcessor IDF1i1+f1F2i2+f2C1i1+c1C2i2+c2For every pair of data dependent accesses F1i1+f1and F2i2+f2Find C1, c1, C2, c2: ∀∀∀∀ i1, i2 F1 i1+ f1= F2 i2+f2→ C1i1+c1= C2i2+c2with the objective of maximizing the rank of C1, C22Advanced CompilersL13. Affine Partitioning1. Example: 1. Example: Loop Interchange (Loop Permutation)Loop Interchange (Loop Permutation)=−−01''1001jjii[ ]0''21=−−jjiiCC−+=+01''1001001001jijifor I = 1 to 4for J = 1 to 3Z[I,J] = Z[I-1,J] IJData dependent operations:∀∀∀∀ i, j, i’, j’ such that, Find C1, C2, c for statement in loop[ ] [ ]cjiCCcjiCC +=+''2121[ ]00121=CC=−−01''jjii[][]1021=CCc = 0 [ ]cjip += 10p = jOne solution: Therefore:Advanced CompilersL13. Affine PartitioningCode Generation: Execute each partition in sequential orderCode Generation: Execute each partition in sequential orderMust be correct!Must be correct!for I = 1 to 4for J = 1 to 3Z[I,J] = Z[I-1,J]IJfor P = 1 to 3for I’ = 1 to 4for J’ = 1 to 3if (J’ = P)Z[I’,J’] = Z[I’-1,J’]for P = 1 to 3for I’ = 1 to 4Z[I’,P] = Z[I’-1,P]=jiip0110'3Advanced CompilersL13. Affine PartitioningGeometric InterpretationGeometric Interpretationfor P = 1 to 3for I’ = 1 to 4Z[I’,P] = Z[I’-1,P]for I = 1 to 4for J = 1 to 3Z[I,J] = Z[I-1,J]IJPI’=jiip0110'Advanced CompilersL13. Affine PartitioningAffine Partitioning Algorithm: Affine Partitioning Algorithm: Maximize Degree of Parallelism with No CommunicationMaximize Degree of Parallelism with No CommunicationFind data dependences For each pair of data dependent operationsSet up equations F1i1+f1= F2i2+f2to capture relations of dependent iterationsReduce the number of unknowns – lots of identities.Set up equations for affine partitions C1i1+c1= C2i2+c2Each operation gets its own C, c Work on C by dropping the c’, Rewrite constraints as Ax = 0, where x is the unknown CsNullity of A is the rank CFind the basis vectorsFind the constant c4Advanced CompilersL13. Affine Partitioning2. Primitive Loop Transformations2. Primitive Loop TransformationsKey idea: If you draw the dependence graph, you can eyeball the code and get the solution derived using linear algebraSeven source-level transformationsUnimodular transform: Reversal, permutation, skewingFusion, fission, re-indexing, scalingAffine partitioning can express arbitrary combinations of these seven primitives. Advanced CompilersL13. Affine PartitioningFusionFusionfor (i=1; i<=N; i++)Y[i] = Z[i]; /*s1*/for (j=1; j<=N; j++)X[j] = Y[j]; /*s2*/s1s2s1 : p = is2 : p = jfor (p=1; p<=N; p++) {Y[p] = Z[p]; X[p] = Y[p]; }Source Code Partition Transformed Code5Advanced CompilersL13. Affine PartitioningFissionFissionfor (i=1; i<=N; i++)Y[i] = Z[i]; /*s1*/for (j=1; j<=N; j++)X[j] = Y[j]; /*s2*/s1s2s1 : i = ps2 : j = pfor (p=1; p<=N; p++) {Y[p] = Z[p]; X[p] = Y[p]; }Source Code Partition Transformed CodeAdvanced CompilersL13. Affine PartitioningReRe--indexingindexingfor (i=1; i<=N; i++) {Y[i] = Z[i]; /*s1*/X[i] = Y[i-1]; /*s2*/}s1s2s1 : p = is2 : p = i – 1if (N>=1) X[1]=Y[0];for (p=1; p<=N; p++) {Y[p] = Z[p]; X[p+1] = Y[p]; }if (N>=1) Y[N]=Z[N];Source Code Partition Transformed Codes1s26Advanced CompilersL13. Affine PartitioningScalingScalingfor (i=1; i<=N; i++)Y[2*i] = Z[2*i]; /*s1*/for (j=1; j<=2*N; j++)X[j] = Y[j]; /*s2*/s1s2s1 : p = 2 × is2 : p = jfor (p=1; p<=2*N; p++){if (p mod 2 == 0)Y[p] = Z[p]; X[p] = Y[p]; }if (N>=1) Y[N]=Z[N];Source Code Partition Transformed Codes1s2Advanced CompilersL13. Affine PartitioningReversalReversalfor (i=0; i<=N; i++)Y[N-i] = Z[i]; /*s1*/for (j=0; j<=N; j++)X[j] = Y[j]; /*s2*/s1s2s1 : p = is2 : p = jfor (p=0; p<=N; p++) {Y[p] = Z[N-p]; X[p] = Y[p]; }Source Code Partition Transformed Code7Advanced CompilersL13. Affine PartitioningPermutationPermutationfor (i=1; i<=N; i++)for (j=0; j<=M; j++)Z[i,j] = Z[i-1,j];Source Code Partition Transformed Code=jiqp0110for (p=0; p<=M; p++)for (q=1; q<=N; q++)Z[q,p] = Z[q-1,p];Advanced CompilersL13. Affine PartitioningSkewingSkewingfor (i=1; i<=N+M-1; i++)for (j=max(1,i+N); j<=min(i,M); j++)Z[i,j] = Z[i-1,j-1];Source Code Partition Transformed Code+−=101011jiqpfor (p=1; p<=N; p++)for (q=1; q<=M; q++)Z[p,q-p] = Z[p-1,q-p-1];8Advanced CompilersL13. Affine Partitioning3. Advanced topic: Pipelining3. Advanced topic: PipeliningSOR (Successive OverSOR (Successive Over--Relaxation): An ExampleRelaxation): An Examplefor i = 0 TO m for j = 0 to nX[j+1] = 1/3 * (X[j] + X[j+1] + X[j+2])jiAdvanced CompilersL13. Affine PartitioningFinding the Maximum Degree of PipeliningFinding the Maximum Degree of PipeliningFor every pair of data dependent accesses F1i1+f1and F2i2+f2Let B1i1+b1≥ 0, B2i2+b2≥ 0 be the corresponding loop bound constraints,Find C1, c1, C2, c2: ∀∀∀∀ i1, i2 B1i1+ b1≥ 0, B2i2+ b2≥ 0(i1 ≤≤≤≤ i2 ) ∧∧∧∧ (F1 i1+ f1= F2 i2+f2) → C1i1+c1≤≤≤≤ C2i2+c2with the objective of maximizing the rank of C1, C2LoopsArrayTime StageF1i1+f1F2i2+f2C1i1+c1C2i2+c2i1 ≤≤≤≤ i29Advanced CompilersL13. Affine PartitioningKey InsightKey InsightChoice in time mapping => (pipelined) parallelismRank(C) – 1 degree of parallelism with 1 degree of synchronizationCan create blocks with Rank(C) dimensionsFind time partitions is not as straightforward as space partitionsNeed to deal with linear inequalitiesSolved using Farkas Lemma – no simple intuitive proofAdvanced CompilersL13. Affine


View Full Document

Stanford CS 243 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?