COMP 206: Computer Architecture and ImplementationOutline7. Reduce Misses by Compiler Optzns.Merging Arrays ExampleLoop Interchange ExampleLoop Fusion ExampleBlocking ExampleBlocking Example (contd.)Slide 9Slide 10Summary: Compiler Optimizations to Reduce Cache MissesReducing Miss Penalty2. Fetching Subblocks to Reduce Miss Penalty3. Early Restart and Critical Word First4. Non-blocking CachesValue of Hit Under Miss for SPEC5. Miss Penalty Reduction: L2 CacheReducing Misses: Which Apply to L2 Cache?L2 cache block size & A.M.A.T.Reducing Miss Penalty SummaryReview: Improving Cache Performance1COMP 206:COMP 206:Computer Architecture and Computer Architecture and ImplementationImplementationMontek SinghMontek SinghWed., Nov. 10, 2004Wed., Nov. 10, 2004Topic: Topic: Caches (contd.)Caches (contd.)2OutlineOutlineImproving Cache PerformanceImproving Cache PerformanceReducing misses -- contd. from previous lectureReducing misses -- contd. from previous lectureReducing miss penaltyReducing miss penaltyReading: HP3 Sections 5.3-5.7Reading: HP3 Sections 5.3-5.737. Reduce Misses by Compiler 7. Reduce Misses by Compiler Optzns.Optzns.InstructionsInstructionsReorder procedures in memory so as to reduce missesReorder procedures in memory so as to reduce missesProfiling to look at conflictsProfiling to look at conflictsMcFarling [1989] reduced caches misses by 75% on 8KB McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocksdirect mapped cache with 4 byte blocksDataDataMerging ArraysMerging ArraysImprove spatial locality by single array of compound elements vs. 2 Improve spatial locality by single array of compound elements vs. 2 arraysarraysLoop InterchangeLoop InterchangeChange nesting of loops to access data in order stored in memoryChange nesting of loops to access data in order stored in memoryLoop FusionLoop FusionCombine two independent loops that have same looping and some Combine two independent loops that have same looping and some variables overlapvariables overlapBlockingBlockingImprove temporal locality by accessing “blocks” of data repeatedly Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rowsvs. going down whole columns or rows4Merging Arrays ExampleMerging Arrays ExampleReduces conflicts between Reduces conflicts between valval and and keykeyAddressing expressions are differentAddressing expressions are different/* Before */int val[SIZE];int key[SIZE];/* Before */int val[SIZE];int key[SIZE];/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];5Loop Interchange ExampleLoop Interchange ExampleSequential accesses instead of striding Sequential accesses instead of striding through memory every 100 wordsthrough memory every 100 words/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];6Loop Fusion ExampleLoop Fusion ExampleBefore: 2 misses per access to Before: 2 misses per access to aa and and ccAfter: 1 miss per access to After: 1 miss per access to aa and and cc/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j];d[i][j] = a[i][j] + c[i][j];}/* After */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j];d[i][j] = a[i][j] + c[i][j];}7Blocking ExampleBlocking ExampleTwo Inner Loops:Two Inner Loops:Read all NxN elements of z[]Read all NxN elements of z[]Read N elements of 1 row of y[] repeatedlyRead N elements of 1 row of y[] repeatedlyWrite N elements of 1 row of x[]Write N elements of 1 row of x[]Capacity Misses a function of N and Cache SizeCapacity Misses a function of N and Cache Size3 NxN 3 NxN no capacity misses; otherwise ... no capacity misses; otherwise ...Idea: compute on BxB submatrix that fitsIdea: compute on BxB submatrix that fits/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }8Blocking Example (contd.)Blocking Example (contd.)Age of accessesAge of accessesWhite means not touched yetWhite means not touched yetLight gray means touched a while agoLight gray means touched a while agoDark gray means newer accessesDark gray means newer accesses9Blocking Example (contd.)Blocking Example (contd.)Work with BxB submatricesWork with BxB submatricessmaller working set can fit within the cachesmaller working set can fit within the cachefewer capacity missesfewer capacity misses/* After */for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }/* After */for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }10Blocking Example (contd.)Blocking Example (contd.)Capacity reqd. goes from (2NCapacity reqd. goes from (2N33 + N + N22) to (2N) to (2N33/B /B +N+N22))BB = = “blocking factor”“blocking factor”11Performance Improvement 1 1.5 2 2.5 3compresscholesky(nasa7)spicemxm (nasa7)btrix (nasa7)tomcatvgmty (nasa7)vpenta
View Full Document