15-745Optimizing For Data Locality - 1Seth Copen [email protected]@cs.cmu.EduCMUBased on “A Data Locality Optimizing Algorithm, Wolf & Lam PLDI ‘9115-745 © 2005-9 Seth Copen Goldstein 1Wolf & Lam, PLDI 91Outline• The Problem•Loop Transformations•Loop Transformations– dependence vectorsTr nsf rm ti ns–Transformations– Unimodular transformationsLli Ali•Locality Analysis•SRP15-745© 2005-9 Seth Copen Goldstein2The Issue• Improve cache reuse in nested loops•Canonical simple case: Matrix Multiply•Canonical simple case: Matrix Multiplyfor I1:= 1 to nfor I2:= 1 to nf1tfor I3:=1 to nC[I1,I3] += A[I1,I2] * B[I2,I3]I3I3I2+1I3I2I3I2+1=I1I215-745© 2005-9 Seth Copen Goldstein3In next iteration of I2previous data that could be reused has been replaced in cache.Tiling solves problemfor I1:= 1 to nfI1tfor I2:=1 to nfor I3:= 1 to nC[I1,I3]+=A[I1,I2] * B[I2,I3]for II:=1tonbysfor II2:=1 to n by sfor II3:= 1 to n by sfor I1:= 1 to nfor I2:=II2to min(II2+s-1,n)o2:2to (2s,)for I3:= II3to min(II3+ s - 1,n)C[I1,I3] += A[I1,I2] * B[I2,I3];I3I2I2+1 I3=II215-745© 2005-9 Seth Copen Goldstein4I1The Problem• How to increase locality by transforming loop nest• Matrix Mult is simple as it is both– legal to tile– advantageous to tile• Can we determine the benefit?(reuse vector space and locality vector space)• Is it legal (and if so, how) to transform loop?(i dl f i )(unimodular transformations)15-745© 2005-9 Seth Copen Goldstein5Handy Representation: “Iteration Space”Iteration Space01ifor i = 0 to N-1for j = 0 to N-1A[i][j] = B[j][i];•each position represents an iterationj•each position represents an iteration15-745 6© 2005-9 Seth Copen GoldsteinVisitation Order in Iteration SpaceSpace01ifor i = 0 to N-1for j = 0 to N-1A[i][j] = B[j][i];•Note: iteration space is not data spacejNote: iteration space is not data space15-745 7© 2005-9 Seth Copen GoldsteinWhen Do Cache Misses Occur?for i = 0 to N-1for j= 0 to N-1jA[i][j] = B[j][i];ABi ij j15-745 8© 2005-9 Seth Copen GoldsteinWhen Do Cache Misses Occur?for i = 0 to N-1for j= 0 to N-1jA[i][j] = B[j][i];ABHitMissi ij j15-745 9© 2005-9 Seth Copen GoldsteinWhen Do Cache Misses Occur?ifor i = 0 to N-1for j = 0 to N-1A[i+j][0] = i*j;jj15-745 10© 2005-9 Seth Copen GoldsteinWhen Do Cache Misses Occur?iHitMissfor i = 0 to N-1for j = 0 to N-1A[i+j][0] = i*j;jj15-745 11© 2005-9 Seth Copen GoldsteinOptimizing the Cache Behavior of Array Accessesof Array Accesses• We need to answer the following questions:– when do cache misses occur?•use “locality analysis”– can we change the order of the iterations ( l d l ) d (or possibly data layout) to produce better behavior?l t th t f i lt ti•evaluate the cost of various alternatives– does the new ordering/layout still produce correct results?correct results?•use “dependence analysis”15-745 12© 2005-9 Seth Copen GoldsteinExamples of Loop TransformationsTransformations• Loop InterchangeCan improve locality• Cache Blocking•SkewingCan improve localitySkewing• Loop ReversalCan enable above•…(we will briefly discuss the first two)15-745 13© 2005-9 Seth Copen GoldsteinLoop Interchangefori=0toN-1forj=0toN-1for i= 0 to N-1for j = 0 to N-1A[j][i] = i*j;for j= 0 to N-1for i = 0 to N-1A[j][i] = i*j;i jHitMiss•(assuming N is large relative to cache size)j i(assuming N is large relative to cache size)15-745 14© 2005-9 Seth Copen GoldsteinImpact on Visitation Order in Iteration Spacein Iteration Spacefor i = 0 to N-1for JJ = 0 to N-1 by Bfori=0toN-1for j = 0 to N-1f(A[i],A[j]);for i 0 to N1for j = JJ to max(N-1,JJ+B-1) f(A[i],A[j]);iijj15-745 15© 2005-9 Seth Copen GoldsteinCache Blocking (aka “Tiling”)for i = 0 to N-1fj0t N1for JJ = 0 to N-1 by Bfor i = 0 to N-1for j= 0 to N-1f(A[i],A[j]);for j = JJ to max(N-1,JJ+B-1)f(A[i],A[j]);iiA[i] A[j]iiA[i] A[j]now we can exploit localityjjjjnow we can exploit locality15-745 16© 2005-9 Seth Copen GoldsteinCache Blocking (aka “Tiling”)for i = 0 to N-1fj0t N1for JJ = 0 to N-1 by Bfor i = 0 to N-1for j= 0 to N-1f(A[i],A[j]);for j = JJ to max(N-1,JJ+B-1)f(A[i],A[j]);iiA[i] A[j]iiA[i] A[j]now we can exploit temporal localityjjjjnow we can exploit temporal locality15-745 17© 2005-9 Seth Copen GoldsteinCache Blocking in Two DimensionsDimensionsfor i = 0 to N-1for j = 0 to N-1for JJ = 0 to N-1 by Bfor KK = 0 to N-1 by Bfor i = 0 to N-1fjJJt(N1JJ B1)for k = 0 to N-1c[i,k] += a[i,j]*b[j,k];for j = JJto max(N-1,JJ+B-1)for k = KK to max(N-1,KK+B-1)c[i,k] += a[i,j]*b[j,k];•brings square sub-blocks of matrix “b” into gqthe cache•completely uses them up before moving on15-745 18© 2005-9 Seth Copen GoldsteinPredicting Cache Behavior through “Locality Analysis”through Locality Analysis• Definitions:R–Reuse:accessing a location that has been accessed in the pastp–Locality:accessing a location that is now found in the cache• Key Insights– Locality only occurs when there is reuse!BUT d t il lt i l lit–BUT, reuse does not necessarily result in locality.–Why not?15-745 19© 2005-9 Seth Copen GoldsteinSteps in Locality Analysis1. Find data reuse– if caches were infinitely large, we would be finished2. Determine “localized iteration space”– set of inner loops where the data accessed pby an iteration is expected to fit within the cache3. Find data locality:– reuse ⊇ localized iteration space ⊇ localitypy15-745 20© 2005-9 Seth Copen GoldsteinTypes of Data Reuse/Localityfori=0to2for i = 0 to 2for j = 0 to 100A[i][j] = B[j][0] + B[j+1][0];HitMissiA[i][j]iB[j][0]iB[j+1][0]jjjSpatial Temporal Group(temporal)15-745 21© 2005-9 Seth Copen GoldsteinKinds of reuse and the factorfor i = 0 to N-1forj=0toN-1What kinds of reuse are there?A[i]?for j 0 to N1f(A[i],A[j]);A[i]?A[j]?15-745© 2005-9 Seth Copen Goldstein22Kinds of reuse and the factorfor I1:= 0 to 5for I2:= 0 to 62A[I2+ 1] = 1/3 * (A[I2]+ A[I2+1]+A[I2+ 2])15-745© 2005-9 Seth Copen Goldstein23Kinds of reuse and the factorfor I1:= 0 to 5for I2:= 0 to 62A[I2+ 1] = 1/3 * (A[I2]+ A[I2+1]+A[I2+ 2])self-temporal in 1, self-spatial in 2Also, group spatial in 2What is different about this and previous?for i = 0 to N-1for j = 0 to N-1f(A[i],A[j]);15-745© 2005-9 Seth Copen Goldstein24Uniformly Generated references• f and g are indexing functions: ZnÆ Zd–n is depth of loop nestn is depth of loop nest– d is dimensions of array, A•Two references A[f(i)] and A[g(i)] are •Two references A[f(i)] and A[g(i)] are
View Full Document