Slide 1TodayCache MemoriesGeneral Cache Organization (S, E, B)Cache ReadExample: Direct Mapped Cache (E = 1)Example: Direct Mapped Cache (E = 1)Example: Direct Mapped Cache (E = 1)Direct-Mapped Cache SimulationA Higher Level ExampleE-way Set Associative Cache (Here: E = 2)E-way Set Associative Cache (Here: E = 2)E-way Set Associative Cache (Here: E = 2)2-Way Set Associative Cache SimulationA Higher Level ExampleSpectrum of AssociativityWhat about writes?Intel Core i7 Cache HierarchyCache Performance MetricsLets think about those numbersAverage memory access time (AMAT)Writing Cache Friendly CodeTodayThe Memory MountainMemory Mountain Test FunctionThe Memory MountainThe Memory MountainThe Memory MountainTodayMiss Rate Analysis for Matrix MultiplyMatrix Multiplication ExampleLayout of C Arrays in Memory (review)Matrix Multiplication (ijk)Matrix Multiplication (jik)Matrix Multiplication (kij)Matrix Multiplication (ikj)Matrix Multiplication (jki)Matrix Multiplication (kji)Summary of Matrix MultiplicationCore i7 Matrix Multiply PerformanceTodayExample: Matrix MultiplicationCache Miss AnalysisCache Miss AnalysisBlocked Matrix MultiplicationCache Miss AnalysisCache Miss AnalysisSummaryConcluding Observations1Cache Memories2TodayCache memory organization and operationPerformance impact of cachesThe memory mountainRearranging loops to improve spatial localityUsing blocking to improve temporal locality3Cache MemoriesCache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memoryCPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory.Typical system structure:MainmemoryI/ObridgeBus interfaceALURegister fileCPU chipSystem bus Memory busCache memories4General Cache Organization (S, E, B)E = 2e lines per setS = 2s setssetline0 1 2B-1tagvB = 2b bytes per cache block (the data)Cache size:C = S x E x B data bytesvalid bit5Cache ReadE = 2e lines per setS = 2s sets0 1 2B-1tagvvalid bitB = 2b bytes per cache block (the data)t bits s bits b bitsAddress of word:tagsetindexblockoffsetdata begins at this offset•Locate set•Check if any line in sethas matching tag•Yes + line valid: hit•Locate data startingat offset6Example: Direct Mapped Cache (E = 1)S = 2s setsDirect mapped: One line per setAssume: cache block size 8 bytest bits 0…01 100Address of int:0 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 654find set7Example: Direct Mapped Cache (E = 1)Direct mapped: One line per setAssume: cache block size 8 bytest bits 0…01 100Address of int:0 1 2 7tagv 3 654match: assume yes = hitvalid? +block offsettag8Example: Direct Mapped Cache (E = 1)Direct mapped: One line per setAssume: cache block size 8 bytest bits 0…01 100Address of int:0 1 2 7tagv 3 654match: assume yes = hitvalid? +int (4 Bytes) is hereblock offsetNo match: old line is evicted and replaced9Direct-Mapped Cache SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 Blocks/setAddress trace (reads, one byte per read):0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002]xt=1 s=2 b=1xx x0 ? ?v Tag Blockmiss1 0 M[0-1]hitmiss1 0 M[6-7]miss1 1 M[8-9]miss1 0 M[0-1]Set 0Set 1Set 2Set 310A Higher Level Exampleint sum_array_rows(double a[16][16]){ int i, j; double sum = 0; for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum;}32 B = 4 doublesassume: cold (empty) cache,a[0][0] goes hereint sum_array_cols(double a[16][16]){ int i, j; double sum = 0; for (j = 0; i < 16; i++) for (i = 0; j < 16; j++) sum += a[i][j]; return sum;}blackboardIgnore the variables sum, i, j11E-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytest bits 0…01 100Address of short int:0 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 654find set12E-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytest bits 0…01 100Address of short int:0 1 2 7tagv 3 6540 1 2 7tagv 3 654compare bothvalid? + match: yes = hitblock offsettag13E-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytest bits 0…01 100Address of short int:0 1 2 7tagv 3 6540 1 2 7tagv 3 654compare bothvalid? + match: yes = hitblock offsetshort int (2 Bytes) is hereNo match: •One line in set is selected for eviction and replacement•Replacement policies: random, least recently used (LRU), …142-Way Set Associative Cache SimulationM=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/setAddress trace (reads, one byte per read):0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002]xxt=2 s=1 b=1x x0 ? ?v Tag Block000miss1 00 M[0-1]hitmiss1 01 M[6-7]miss1 10 M[8-9]hitSet 0Set 115A Higher Level Exampleint sum_array_rows(double a[16][16]){ int i, j; double sum = 0; for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum;}32 B = 4 doublesassume: cold (empty) cache,a[0][0] goes hereint sum_array_rows(double a[16][16]){ int i, j; double sum = 0; for (j = 0; i < 16; i++) for (i = 0; j < 16; j++) sum += a[i][j]; return sum;}blackboardIgnore the variables sum, i, j16Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16Spectrum of AssociativityFor a cache with 8 entries17What about writes?Multiple copies of data exist:L1, L2, Main Memory, DiskWhat to do on a write-hit?Write-through (write immediately to memory)Write-back (defer write to memory until replacement of line)Need a dirty bit (line different from memory or not)What to do on a write-miss?Write-allocate (load into cache, update line in cache)Good if more writes to the location followNo-write-allocate (writes immediately to memory)TypicalWrite-through + No-write-allocateWrite-back + Write-allocate18Intel Core i7 Cache HierarchyRegsL1 d-cacheL1 i-cacheL2 unified cacheCore 0RegsL1 d-cacheL1 i-cacheL2 unified cacheCore 3…L3 unified cache(shared by all cores)Main memoryProcessor packageL1 i-cache and d-cache:32 KB, 8-way, Access: 4 cyclesL2 unified cache: 256 KB, 8-way, Access: 11 cyclesL3 unified cache:8 MB, 16-way,Access: 30-40 cyclesBlock size: 64 bytes for all caches.19Cache Performance MetricsMiss RateFraction of memory references not found in cache (misses / accesses)= 1 – hit
View Full Document