Slide 1Last TimeLast TimeTodayCache MemoriesInserting an L1 Cache Between the CPU and Main MemoryGeneral Cache Organization (S, E, B)Cache ReadExample: Direct Mapped Cache (E = 1)Example: Direct Mapped Cache (E = 1)Example: Direct Mapped Cache (E = 1)Direct-Mapped Cache SimulationE-way Set Associative Cache (Here: E = 2)E-way Set Associative Cache (Here: E = 2)E-way Set Associative Cache (Here: E = 2)2-Way Associative Cache SimulationWhy Use Middle Bits as Index?What about writes?Cache Performance MetricsSoftware Caches are More FlexibleStrided Access QuestionThe Strided Access Problem (Blackboard?)Writing Cache Friendly CodeTodayThe Memory MountainMemory Mountain Test FunctionMemory Mountain Main RoutineThe Memory MountainX86-64 Memory MountainOpteron Memory MountainRidges of Temporal LocalityA Slope of Spatial LocalityTodayMatrix Multiplication ExampleMiss Rate Analysis for Matrix MultiplyLayout of C Arrays in Memory (review)Matrix Multiplication (ijk)Matrix Multiplication (jik)Matrix Multiplication (kij)Matrix Multiplication (ikj)Matrix Multiplication (jki)Matrix Multiplication (kji)Summary of Matrix MultiplicationPentium Matrix Multiply PerformanceImproving Temporal Locality by BlockingBlocked Matrix Multiply (bijk)Blocked Matrix Multiply AnalysisPentium Blocked Matrix Multiply PerformanceObservationsTodayExample: Matrix MultiplicationCache Miss AnalysisCache Miss AnalysisBlocked Matrix MultiplicationCache Miss AnalysisCache Miss AnalysisSummaryCarnegie MellonIntroduction to Computer Systems15-213, fall 200911th Lecture, Oct. 5th Instructors: Majd Sakr and Khaled HarrasCarnegie MellonLast TimeMemory hierarchy (Here: Core 2 Duo)DiskMain MemoryL2 unified cacheL1 I-cacheL1 D-cacheCPU Reg2 B/cycle8 B/cycle16 B/cycle 1 B/30 cyclesThroughput:Latency: 100 cycles14 cycles3 cycles millions~4 MB32 KB~4 GB ~500 GBCarnegie MellonLast TimeLocalityTemporal locality: Recently referenced items are likely to be referenced again in the near futureSpatial locality: Items with nearby addresses tend to be referenced close together in timeblockblockCarnegie MellonTodayCache organization Memory Wall Program optimization (Matrix Multiplication):Cache optimizationsCache Miss AnalysisCarnegie MellonCache MemoriesCache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memoryCPU looks first for data in L1, then in L2, then in main memory.Typical system structure:mainmemoryI/Obridgebus interfaceL2 dataALUregister fileCPU chipSRAM Portsystem busmemory busL1 cacheL2tagsCarnegie MellonInserting an L1 Cache Between the CPU and Main Memorya b c dblock 10p q r sblock 21......w x y zblock 30...The big slow main memory has room for many 4-word blocks.The small fast L1 cache has room for two 4-word blocks.The tiny, very fast CPU register file has room for four 4-byte words.The transfer unit between the cache and main memory is a 4-word block (16 bytes).The transfer unit between the CPU register file and the cache is a 4-byte block.line 0line 1Carnegie MellonGeneral Cache Organization (S, E, B)E = 2e lines per setS = 2s setssetline0 1 2B-1tagvvalid bitB = 2b bytes per cache block (the data)Cache size:S x E x B data bytesCarnegie MellonCache ReadE = 2e lines per setS = 2s sets0 1 2B-1tagvvalid bitB = 2b bytes per cache block (the data)t bits s bits b bitsAddress of word:tagsetindexblockoffsetdata begins at this offset•Locate set•Check if any line in sethas matching tag•Yes + line valid: hit•Locate data startingat offsetCarnegie MellonExample: Direct Mapped Cache (E = 1)S = 2s setsDirect mapped: One line per setAssume: cache block size 8 bytest bits 0…01 100Address of int:0 1 27tagv 36540 1 27tagv 36540 1 27tagv 36540 1 27tagv 3654find setCarnegie MellonExample: Direct Mapped Cache (E = 1)Direct mapped: One line per setAssume: cache block size 8 bytest bits 0…01 100Address of int:0 1 27tagv 3654match: assume yes = hitvalid? +block offsettagCarnegie MellonExample: Direct Mapped Cache (E = 1)Direct mapped: One line per setAssume: cache block size 8 bytest bits 0…01 100Address of int:0 1 27tagv 3654match: assume yes = hitvalid? +int (4 Bytes) is hereblock offsetNo match: old line is evicted and replacedCarnegie MellonDirect-Mapped Cache SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/setAddress trace (reads):0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002]xt=1 s=2 b=1xx x0 ? ?v tag datamiss1 0 M[0-1]hitmiss1 0 M[6-7]miss1 1 M[8-9]miss1 0 M[0-1]Carnegie MellonE-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytest bits 0…01 100Address of short int:0 1 27tagv 36540 1 27tagv 36540 1 27tagv 36540 1 27tagv 36540 1 27tagv 36540 1 27tagv 36540 1 27tagv 36540 1 27tagv 3654find setCarnegie MellonE-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytest bits 0…01 100Address of short int:0 1 27tagv 36540 1 27tagv 3654compare bothvalid? + match: yes = hitblock offsettagCarnegie MellonE-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytest bits 0…01 100Address of short int:0 1 27tagv 36540 1 27tagv 3654match bothvalid? + match: yes = hitblock offsetshort int (2 Bytes) is hereNo match: •One line in set is selected for eviction and replacement•Replacement policies: random, least recently used (LRU), …Carnegie Mellon2-Way Associative Cache SimulationM=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 entry/setAddress trace (reads):0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002]xxt=2s=1 b=1x x0 ? ?v tag data000miss1 00 M[0-1]hitmiss1 01 M[6-7]miss1 10 M[8-9]hitCarnegie MellonWhy Use Middle Bits as Index?High-Order Bit IndexingAdjacent memory lines would map to same cache entryPoor use of spatial localityMiddle-Order Bit IndexingConsecutive memory lines map to different cache linesCan hold S*B*E-byte region of address space in cache at one time4-line CacheHigh-OrderBit IndexingMiddle-OrderBit Indexing0001101100000001001000110100010101100111100010011010101111001101111011110000000100100011010001010110011110001001101010111100110111101111Carnegie MellonWhat about writes?Multiple copies of data exist:L1, L2, Main Memory, DiskWhat to do on a write-hit?Write-through (write immediately to memory)Write-back (defer write to memory until replacement of line)Need a dirty bit (line different from
View Full Document