Unformatted text preview:

Slide 1TodayCache MemoriesGeneral Cache Organization (S, E, B)Cache ReadExample: Direct Mapped Cache (E = 1)Example: Direct Mapped Cache (E = 1)Example: Direct Mapped Cache (E = 1)Direct-Mapped Cache SimulationA Higher Level ExampleE-way Set Associative Cache (Here: E = 2)E-way Set Associative Cache (Here: E = 2)E-way Set Associative Cache (Here: E = 2)2-Way Set Associative Cache SimulationA Higher Level ExampleSpectrum of AssociativityWhat about writes?Intel Core i7 Cache HierarchyCache Performance MetricsLets think about those numbersAverage memory access time (AMAT)Writing Cache Friendly CodeTodayThe Memory MountainMemory Mountain Test FunctionThe Memory MountainThe Memory MountainThe Memory MountainTodayMiss Rate Analysis for Matrix MultiplyMatrix Multiplication ExampleLayout of C Arrays in Memory (review)Matrix Multiplication (ijk)Matrix Multiplication (jik)Matrix Multiplication (kij)Matrix Multiplication (ikj)Matrix Multiplication (jki)Matrix Multiplication (kji)Summary of Matrix MultiplicationCore i7 Matrix Multiply PerformanceTodayExample: Matrix MultiplicationCache Miss AnalysisCache Miss AnalysisBlocked Matrix MultiplicationCache Miss AnalysisCache Miss AnalysisSummaryConcluding Observations1Cache Memories2TodayCache memory organization and operationPerformance impact of cachesThe memory mountainRearranging loops to improve spatial localityUsing blocking to improve temporal locality3Cache MemoriesCache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memoryCPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory.Typical system structure:MainmemoryI/ObridgeBus interfaceALURegister fileCPU chipSystem bus Memory busCache memories4General Cache Organization (S, E, B)E = 2e lines per setS = 2s setssetline0 1 2B-1tagvB = 2b bytes per cache block (the data)Cache size:C = S x E x B data bytesvalid bit5Cache ReadE = 2e lines per setS = 2s sets0 1 2B-1tagvvalid bitB = 2b bytes per cache block (the data)t bits s bits b bitsAddress of word:tagsetindexblockoffsetdata begins at this offset•Locate set•Check if any line in sethas matching tag•Yes + line valid: hit•Locate data startingat offset6Example: Direct Mapped Cache (E = 1)S = 2s setsDirect mapped: One line per setAssume: cache block size 8 bytest bits 0…01 100Address of int:0 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 654find set7Example: Direct Mapped Cache (E = 1)Direct mapped: One line per setAssume: cache block size 8 bytest bits 0…01 100Address of int:0 1 2 7tagv 3 654match: assume yes = hitvalid? +block offsettag8Example: Direct Mapped Cache (E = 1)Direct mapped: One line per setAssume: cache block size 8 bytest bits 0…01 100Address of int:0 1 2 7tagv 3 654match: assume yes = hitvalid? +int (4 Bytes) is hereblock offsetNo match: old line is evicted and replaced9Direct-Mapped Cache SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 Blocks/setAddress trace (reads, one byte per read):0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002]xt=1 s=2 b=1xx x0 ? ?v Tag Blockmiss1 0 M[0-1]hitmiss1 0 M[6-7]miss1 1 M[8-9]miss1 0 M[0-1]Set 0Set 1Set 2Set 310A Higher Level Exampleint sum_array_rows(double a[16][16]){ int i, j; double sum = 0; for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum;}32 B = 4 doublesassume: cold (empty) cache,a[0][0] goes hereint sum_array_cols(double a[16][16]){ int i, j; double sum = 0; for (j = 0; i < 16; i++) for (i = 0; j < 16; j++) sum += a[i][j]; return sum;}blackboardIgnore the variables sum, i, j11E-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytest bits 0…01 100Address of short int:0 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 6540 1 2 7tagv 3 654find set12E-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytest bits 0…01 100Address of short int:0 1 2 7tagv 3 6540 1 2 7tagv 3 654compare bothvalid? + match: yes = hitblock offsettag13E-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytest bits 0…01 100Address of short int:0 1 2 7tagv 3 6540 1 2 7tagv 3 654compare bothvalid? + match: yes = hitblock offsetshort int (2 Bytes) is hereNo match: •One line in set is selected for eviction and replacement•Replacement policies: random, least recently used (LRU), …142-Way Set Associative Cache SimulationM=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/setAddress trace (reads, one byte per read):0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002]xxt=2 s=1 b=1x x0 ? ?v Tag Block000miss1 00 M[0-1]hitmiss1 01 M[6-7]miss1 10 M[8-9]hitSet 0Set 115A Higher Level Exampleint sum_array_rows(double a[16][16]){ int i, j; double sum = 0; for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum;}32 B = 4 doublesassume: cold (empty) cache,a[0][0] goes hereint sum_array_rows(double a[16][16]){ int i, j; double sum = 0; for (j = 0; i < 16; i++) for (i = 0; j < 16; j++) sum += a[i][j]; return sum;}blackboardIgnore the variables sum, i, j16Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16Spectrum of AssociativityFor a cache with 8 entries17What about writes?Multiple copies of data exist:L1, L2, Main Memory, DiskWhat to do on a write-hit?Write-through (write immediately to memory)Write-back (defer write to memory until replacement of line)Need a dirty bit (line different from memory or not)What to do on a write-miss?Write-allocate (load into cache, update line in cache)Good if more writes to the location followNo-write-allocate (writes immediately to memory)TypicalWrite-through + No-write-allocateWrite-back + Write-allocate18Intel Core i7 Cache HierarchyRegsL1 d-cacheL1 i-cacheL2 unified cacheCore 0RegsL1 d-cacheL1 i-cacheL2 unified cacheCore 3…L3 unified cache(shared by all cores)Main memoryProcessor packageL1 i-cache and d-cache:32 KB, 8-way, Access: 4 cyclesL2 unified cache: 256 KB, 8-way, Access: 11 cyclesL3 unified cache:8 MB, 16-way,Access: 30-40 cyclesBlock size: 64 bytes for all caches.19Cache Performance MetricsMiss RateFraction of memory references not found in cache (misses / accesses)= 1 – hit


View Full Document

UT CS 429H - Cache Memories

Download Cache Memories
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Cache Memories and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Cache Memories 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?