Unformatted text preview:

COMP 206: Computer Architecture and ImplementationOutline7. Reduce Misses by Compiler Optzns.Merging Arrays ExampleLoop Interchange ExampleLoop Fusion ExampleBlocking ExampleBlocking Example (contd.)Slide 9Slide 10Summary: Compiler Optimizations to Reduce Cache MissesReducing Miss Penalty2. Fetching Subblocks to Reduce Miss Penalty3. Early Restart and Critical Word First4. Non-blocking CachesValue of Hit Under Miss for SPEC5. Miss Penalty Reduction: L2 CacheReducing Misses: Which Apply to L2 Cache?L2 cache block size & A.M.A.T.Reducing Miss Penalty SummaryReview: Improving Cache Performance1COMP 206:COMP 206:Computer Architecture and Computer Architecture and ImplementationImplementationMontek SinghMontek SinghWed., Nov. 10, 2004Wed., Nov. 10, 2004Topic: Topic: Caches (contd.)Caches (contd.)2OutlineOutlineImproving Cache PerformanceImproving Cache PerformanceReducing misses -- contd. from previous lectureReducing misses -- contd. from previous lectureReducing miss penaltyReducing miss penaltyReading: HP3 Sections 5.3-5.7Reading: HP3 Sections 5.3-5.737. Reduce Misses by Compiler 7. Reduce Misses by Compiler Optzns.Optzns.InstructionsInstructionsReorder procedures in memory so as to reduce missesReorder procedures in memory so as to reduce missesProfiling to look at conflictsProfiling to look at conflictsMcFarling [1989] reduced caches misses by 75% on 8KB McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocksdirect mapped cache with 4 byte blocksDataDataMerging ArraysMerging ArraysImprove spatial locality by single array of compound elements vs. 2 Improve spatial locality by single array of compound elements vs. 2 arraysarraysLoop InterchangeLoop InterchangeChange nesting of loops to access data in order stored in memoryChange nesting of loops to access data in order stored in memoryLoop FusionLoop FusionCombine two independent loops that have same looping and some Combine two independent loops that have same looping and some variables overlapvariables overlapBlockingBlockingImprove temporal locality by accessing “blocks” of data repeatedly Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rowsvs. going down whole columns or rows4Merging Arrays ExampleMerging Arrays ExampleReduces conflicts between Reduces conflicts between valval and and keykeyAddressing expressions are differentAddressing expressions are different/* Before */int val[SIZE];int key[SIZE];/* Before */int val[SIZE];int key[SIZE];/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];5Loop Interchange ExampleLoop Interchange ExampleSequential accesses instead of striding Sequential accesses instead of striding through memory every 100 wordsthrough memory every 100 words/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];6Loop Fusion ExampleLoop Fusion ExampleBefore: 2 misses per access to Before: 2 misses per access to aa and and ccAfter: 1 miss per access to After: 1 miss per access to aa and and cc/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j];d[i][j] = a[i][j] + c[i][j];}/* After */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j];d[i][j] = a[i][j] + c[i][j];}7Blocking ExampleBlocking ExampleTwo Inner Loops:Two Inner Loops:Read all NxN elements of z[]Read all NxN elements of z[]Read N elements of 1 row of y[] repeatedlyRead N elements of 1 row of y[] repeatedlyWrite N elements of 1 row of x[]Write N elements of 1 row of x[]Capacity Misses a function of N and Cache SizeCapacity Misses a function of N and Cache Size3 NxN 3 NxN  no capacity misses; otherwise ... no capacity misses; otherwise ...Idea: compute on BxB submatrix that fitsIdea: compute on BxB submatrix that fits/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }8Blocking Example (contd.)Blocking Example (contd.)Age of accessesAge of accessesWhite means not touched yetWhite means not touched yetLight gray means touched a while agoLight gray means touched a while agoDark gray means newer accessesDark gray means newer accesses9Blocking Example (contd.)Blocking Example (contd.)Work with BxB submatricesWork with BxB submatricessmaller working set can fit within the cachesmaller working set can fit within the cachefewer capacity missesfewer capacity misses/* After */for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }/* After */for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }10Blocking Example (contd.)Blocking Example (contd.)Capacity reqd. goes from (2NCapacity reqd. goes from (2N33 + N + N22) to (2N) to (2N33/B /B +N+N22))BB = = “blocking factor”“blocking factor”11Performance Improvement 1 1.5 2 2.5 3compresscholesky(nasa7)spicemxm (nasa7)btrix (nasa7)tomcatvgmty (nasa7)vpenta


View Full Document

UNC-Chapel Hill COMP 206 - LECTURE NOTES

Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?