So Far Fully associative cache Memory block can be stored in any cache block Write through cache Write store changes both cache and main memory right away Reads only require getting block on cache miss Write back cache Write changes only cache Read causes write of dirty block to memory on a replace Reads easy to make fast writes harder Read data from cache in parallel with checking address against tag of cache block Write must verify address against tag before update CMSC 411 Computer Systems Architecture Lecture 15 Memory Hierarchy 2 Cache Optimizations 2 CMSC 411 13 some from Patterson Sussman others Write Buffers for Write Through Caches Example Alpha 21064 Cache Processor Lower Level Memory Write Buffer Holds data awaiting write through to lower level memory CMSC 411 13 some from Patterson Sussman others A So CPU doesn t stall Q Why a buffer why not just one register A Bursts of writes are common Q Are Read After Write RAW hazards an issue for write buffer A Yes Drain buffer before next read or send read 1st after check write buffers CMSC 411 13 som e from Patters on Sussman others How much do stalls slow a machine Memory stalls cont Suppose that on pipelined MIPS each instruction takes on average 2 clock cycles not counting cache faults misses Suppose on average there are 1 33 memory references per instruction memory access time is 50 cycles and the miss rate is 2 Then each instruction takes on average 2 0 98 1 33 02 50 3 33 clock cycles To reduce the impact of cache misses can reduce any of three parameters Main memory access time miss penalty Cache access hit time Miss rate CMSC 411 13 some from Patterson Sussman others CS252 S05 3 Q Why a write buffer 5 CMSC 411 13 some from Patterson Sussman others 4 6 iMac s PowerPC 970 All caches on chip Example Apple iMac G5 L1 64K Instruction Managed by hardware Managed by compiler Size Latency Cycles Time Managed by OS hardware application Reg L1 Inst L1 Data L2 DRAM Disk 1K 64K 32K 512K 256M 80G 1 0 6 ns 3 1 9 ns 3 1 9 ns 11 6 9 ns 88 55 ns 107 12 ms 512K L2 iMac G5 1 6 GHz Goal Illusion of large fast cheap memory Let programs address a memory space that scales to the disk size at a speed that is usually as fast as register access CMSC 411 13 som e from Patterson Sussman others 1K L1 32K Data 7 CMSC 411 13 some from Patterson Sussman others 8 CMSC 411 13 some from Patterson Sussman others 10 Miss rate Cache miss terminology Sometimes cache misses are inevitable Compulsory miss Fig C 9 SPEC2000 LRU replacement The first time a block is used need to bring it into cache Capacity miss If need to use more blocks at once than can fit into cache some will bounce in and out Conflict miss In direct mapped or set associative caches there are certain combinations of addresses that cannot be in cache at the same time CMSC 411 13 some from Patterson Sussman others 9 5 Basic Cache Optimizations More Terminology Reducing Miss Rate 1 Larger Block size compulsory misses 2 Larger Cache size capacity misses 3 Higher Associativity conflict misses write allocate Ensure block in cache before performing a write operation write no allocate Don t allocate block in cache if not already there Reducing Miss Penalty 4 Multilevel Caches Reducing hit time 5 Giving Reads Priority over Writes E g Read completes before earlier writes in write buffer CMSC 411 13 some from Patterson Sussman others CS252 S05 11 CMSC 411 13 some from Patterson Sussman others 12 Another write buffer optimization Don t wait for whole block on cache miss Write buffer mechanics with merging An entry may contain multiple words maybe even a whole cache block If there s an empty entry the data and address are written to the buffer and the CPU is done with the write If buffer contains other modified blocks check to see if new address matches one already in the buffer if so combine the new data with that entry If buffer full and no address match cache and CPU wait for an empty entry to appear meaning some entry has been written to main memory Merging improves memory efficiency since multiword writes usually faster than one word at a time CMSC 411 13 some from Patterson Sussman others Access the required word as soon as it is fetched instead of waiting for the whole block Critical word first Start the fetch with word 10 and fill in the first few later 13 CMSC 411 13 some from Patterson Sussman others Use a nonblocking cache So Far cont With this optimization the cache doesn t stop for a miss but continues to process later requests if possible even though an earlier one is not yet fulfilled Introduces significant complexity into cache architecture have to allow multiple outstanding cache requests maybe even multiple misses but this is what s done in modern processors Reducing memory stalls Reduce miss penalty miss rate cache hit time Reducing miss penalty Give priority to read over write misses Don t wait for the whole block Use a non blocking cache CMSC 411 13 some from Patterson Sussman others 15 CMSC 411 13 some from Patterson Sussman others Multi level cache Victim caches For example if cache takes 1 clock cycle and memory takes 50 might be a good idea to add a larger but necessarily slower secondary cache in between perhaps capable of 10 clock cycle access Complicates performance analysis see H P but 2nd level cache captures many of 1st level cache misses lowering effective miss penalty and 3rd level cache has same benefits for 2nd level cache Most modern machines have separate 1st level instruction and data caches shared 2nd level cache and off processor chip shared 3rd level cache To remember a cache block that has recently been replaced evicted Use a small fully associative cache between a cache and where it gets data from Check the victim cache on a cache miss before going to next lower level memory CMSC 411 13 some from Patterson Sussman others CS252 S05 Two ways to do this suppose need the 10th word in a block Early restart 14 16 If found swap victim block and cache block Reduces conflict misses 17 CMSC 411 13 some from Patterson Sussman others 18 Victim caches cont How to reduce the miss rate Use larger blocks Use more associativity to reduce conflict misses Victim cache Pseudo associative caches won t talk about this Prefetch hardware controlled Prefetch compiler controlled Compiler optimizations Figure from H P 3ed 19 CMSC 411 13 some from Patterson Sussman others Increasing block size Increasing block size cont Want the block size large so don t have to stop so often to load
View Full Document