Unformatted text preview:

COMP 206: Computer Architecture and ImplementationOutline5. Reduce Misses by Hardware Prefetching6. Reducing Misses by Software Prefetching7. Reduce Misses by Compiler Optzns.Merging Arrays ExampleLoop Interchange ExampleLoop Fusion ExampleBlocking ExampleBlocking Example (contd.)Slide 11Slide 12Reducing Conflict Misses by BlockingSummary: Compiler Optimizations to Reduce Cache MissesReducing Miss Penalty2. Fetching Subblocks to Reduce Miss Penalty3. Early Restart and Critical Word First4. Non-blocking CachesValue of Hit Under Miss for SPEC5. Miss Penalty Reduction: L2 CacheReducing Misses: Which Apply to L2 Cache?L2 cache block size & A.M.A.T.Reducing Miss Penalty SummaryReview: Improving Cache Performance1COMP 206:COMP 206:Computer Architecture and Computer Architecture and ImplementationImplementationMontek SinghMontek SinghWed., Nov. 10, 2003Wed., Nov. 10, 2003Topic: Topic: Caches (contd.)Caches (contd.)2OutlineOutlineImproving Cache PerformanceImproving Cache PerformanceReducing misses -- contd. from previous lectureReducing misses -- contd. from previous lectureReducing miss penaltyReducing miss penaltyReading: HP3 Sections 5.3-5.7Reading: HP3 Sections 5.3-5.735. Reduce Misses by Hardware 5. Reduce Misses by Hardware PrefetchingPrefetchingInstruction prefetchingInstruction prefetchingAlpha 21064 fetches 2 blocks on a missAlpha 21064 fetches 2 blocks on a missExtra block placed in stream bufferExtra block placed in stream bufferOn miss check stream bufferOn miss check stream bufferWorks with data blocks tooWorks with data blocks tooJouppi [1990] 1 data stream buffer got 25% misses Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43%from 4KB cache; 4 stream buffers got 43%Palacharla & Kessler [1994] for scientific programs for Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative cachesway set associative cachesPrefetching relies on extra memory bandwidth Prefetching relies on extra memory bandwidth that can be used without penaltythat can be used without penaltye.g., up to 8 prefetch stream buffers in the UltraSPARC e.g., up to 8 prefetch stream buffers in the UltraSPARC IIIIII46. Reducing Misses by Software 6. Reducing Misses by Software PrefetchingPrefetchingData prefetchData prefetchCompiler inserts special “prefetch” instructions into Compiler inserts special “prefetch” instructions into programprogramLoad data into register (HP PA-RISC loads)Load data into register (HP PA-RISC loads)Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC v9)Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC v9)A form of speculative executionA form of speculative executiondon’t really know if data is needed or if not in cache alreadydon’t really know if data is needed or if not in cache alreadyMost effective prefetches are “semantically invisible” Most effective prefetches are “semantically invisible” to prgmto prgmdoes not change registers or memorydoes not change registers or memorycannot cause a fault/exceptioncannot cause a fault/exceptionif they would fault, they are simply turned into NOP’sif they would fault, they are simply turned into NOP’sIssuing prefetch instructions takes timeIssuing prefetch instructions takes timeIs cost of prefetch issues < savings in reduced misses?Is cost of prefetch issues < savings in reduced misses?57. Reduce Misses by Compiler 7. Reduce Misses by Compiler Optzns.Optzns.InstructionsInstructionsReorder procedures in memory so as to reduce missesReorder procedures in memory so as to reduce missesProfiling to look at conflictsProfiling to look at conflictsMcFarling [1989] reduced caches misses by 75% on 8KB McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocksdirect mapped cache with 4 byte blocksDataDataMerging ArraysMerging ArraysImprove spatial locality by single array of compound elements vs. 2 Improve spatial locality by single array of compound elements vs. 2 arraysarraysLoop InterchangeLoop InterchangeChange nesting of loops to access data in order stored in memoryChange nesting of loops to access data in order stored in memoryLoop FusionLoop FusionCombine two independent loops that have same looping and some Combine two independent loops that have same looping and some variables overlapvariables overlapBlockingBlockingImprove temporal locality by accessing “blocks” of data repeatedly Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rowsvs. going down whole columns or rows6Merging Arrays ExampleMerging Arrays ExampleReduces conflicts between Reduces conflicts between valval and and keykeyAddressing expressions are differentAddressing expressions are different/* Before */int val[SIZE];int key[SIZE];/* Before */int val[SIZE];int key[SIZE];/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];7Loop Interchange ExampleLoop Interchange ExampleSequential accesses instead of striding Sequential accesses instead of striding through memory every 100 wordsthrough memory every 100 words/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];8Loop Fusion ExampleLoop Fusion ExampleBefore: 2 misses per access to Before: 2 misses per access to aa and and ccAfter: 1 miss per access to After: 1 miss per access to aa and and cc/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i++) for (j = 0; j < N;


View Full Document

UNC-Chapel Hill COMP 206 - LECTURE NOTES

Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?