COMP 206: Computer Architecture and ImplementationOutline5. Reduce Misses by Hardware Prefetching6. Reducing Misses by Software Prefetching7. Reduce Misses by Compiler Optzns.Merging Arrays ExampleLoop Interchange ExampleLoop Fusion ExampleBlocking ExampleBlocking Example (contd.)Slide 11Slide 12Reducing Conflict Misses by BlockingSummary: Compiler Optimizations to Reduce Cache MissesReducing Miss Penalty2. Fetching Subblocks to Reduce Miss Penalty3. Early Restart and Critical Word First4. Non-blocking CachesValue of Hit Under Miss for SPEC5. Miss Penalty Reduction: L2 CacheReducing Misses: Which Apply to L2 Cache?L2 cache block size & A.M.A.T.Reducing Miss Penalty SummaryReview: Improving Cache Performance1COMP 206:COMP 206:Computer Architecture and Computer Architecture and ImplementationImplementationMontek SinghMontek SinghWed., Nov. 10, 2003Wed., Nov. 10, 2003Topic: Topic: Caches (contd.)Caches (contd.)2OutlineOutlineImproving Cache PerformanceImproving Cache PerformanceReducing misses -- contd. from previous lectureReducing misses -- contd. from previous lectureReducing miss penaltyReducing miss penaltyReading: HP3 Sections 5.3-5.7Reading: HP3 Sections 5.3-5.735. Reduce Misses by Hardware 5. Reduce Misses by Hardware PrefetchingPrefetchingInstruction prefetchingInstruction prefetchingAlpha 21064 fetches 2 blocks on a missAlpha 21064 fetches 2 blocks on a missExtra block placed in stream bufferExtra block placed in stream bufferOn miss check stream bufferOn miss check stream bufferWorks with data blocks tooWorks with data blocks tooJouppi [1990] 1 data stream buffer got 25% misses Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43%from 4KB cache; 4 stream buffers got 43%Palacharla & Kessler [1994] for scientific programs for Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative cachesway set associative cachesPrefetching relies on extra memory bandwidth Prefetching relies on extra memory bandwidth that can be used without penaltythat can be used without penaltye.g., up to 8 prefetch stream buffers in the UltraSPARC e.g., up to 8 prefetch stream buffers in the UltraSPARC IIIIII46. Reducing Misses by Software 6. Reducing Misses by Software PrefetchingPrefetchingData prefetchData prefetchCompiler inserts special “prefetch” instructions into Compiler inserts special “prefetch” instructions into programprogramLoad data into register (HP PA-RISC loads)Load data into register (HP PA-RISC loads)Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC v9)Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC v9)A form of speculative executionA form of speculative executiondon’t really know if data is needed or if not in cache alreadydon’t really know if data is needed or if not in cache alreadyMost effective prefetches are “semantically invisible” Most effective prefetches are “semantically invisible” to prgmto prgmdoes not change registers or memorydoes not change registers or memorycannot cause a fault/exceptioncannot cause a fault/exceptionif they would fault, they are simply turned into NOP’sif they would fault, they are simply turned into NOP’sIssuing prefetch instructions takes timeIssuing prefetch instructions takes timeIs cost of prefetch issues < savings in reduced misses?Is cost of prefetch issues < savings in reduced misses?57. Reduce Misses by Compiler 7. Reduce Misses by Compiler Optzns.Optzns.InstructionsInstructionsReorder procedures in memory so as to reduce missesReorder procedures in memory so as to reduce missesProfiling to look at conflictsProfiling to look at conflictsMcFarling [1989] reduced caches misses by 75% on 8KB McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocksdirect mapped cache with 4 byte blocksDataDataMerging ArraysMerging ArraysImprove spatial locality by single array of compound elements vs. 2 Improve spatial locality by single array of compound elements vs. 2 arraysarraysLoop InterchangeLoop InterchangeChange nesting of loops to access data in order stored in memoryChange nesting of loops to access data in order stored in memoryLoop FusionLoop FusionCombine two independent loops that have same looping and some Combine two independent loops that have same looping and some variables overlapvariables overlapBlockingBlockingImprove temporal locality by accessing “blocks” of data repeatedly Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rowsvs. going down whole columns or rows6Merging Arrays ExampleMerging Arrays ExampleReduces conflicts between Reduces conflicts between valval and and keykeyAddressing expressions are differentAddressing expressions are different/* Before */int val[SIZE];int key[SIZE];/* Before */int val[SIZE];int key[SIZE];/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];7Loop Interchange ExampleLoop Interchange ExampleSequential accesses instead of striding Sequential accesses instead of striding through memory every 100 wordsthrough memory every 100 words/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];8Loop Fusion ExampleLoop Fusion ExampleBefore: 2 misses per access to Before: 2 misses per access to aa and and ccAfter: 1 miss per access to After: 1 miss per access to aa and and cc/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i++) for (j = 0; j < N;
View Full Document