Why is memory important Memory wall due to emphasis on speed for processor density for DRAM Memory Hierarchy Design 1000 CPU CPU DRAM Speed Gap 100 10 DRAM 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 1990 1989 1988 1987 1986 1985 1984 1983 1982 1981 1980 1 Memory Hierarchy The Goal Provide memory system with cost of cheapest level of memory and with speed as fast as fastest level of memory CPU registers Instruction cache Upper level Lower level Data cache Unified cache L1 cache L2 cache DRAM memory Secondary storage The correctness criteria The execution results of a program should be as if it were executed on a system without any cache memory Page 1 1 General Caching Principles Some Definitions Locality Upper is closer to processor Block minimum unit of information that can be present in cache Hit time time to access cache including hit determination Hit rate fraction found in the cache Temporal Locality referenced again soon Spatial Locality nearby items referenced soon Locality smaller HW is faster memory hierarchy Levels each smaller faster more expensive byte than level below Inclusive data found in top also found in the bottom Miss rate 1 Hit Rate Miss penalty time to fetch a block from lower level including time to replace in CPU access time time to access lower level transfer time time to transfer block Four Questions for Memory Hierarchy Designers Differences in Memory Levels Q1 Where can a block be placed in the upper level Block placement Q2 How is a block found if it is in the upper level Block identification Q3 Which block should be replaced on a miss Block replacement Q4 What happens on a write Write strategy Page 2 2 Q1 Where can a block be placed in the upper level Associativity Examples Fully associative Block 12 can go anywhere Direct Mapped Each block has only one place that it can appear in the cache Fully associative Each block can be placed anywhere in the cache Set associative Each block can be placed in a restricted set of places in the cache Direct mapped Block no Block address mod of blocks in cache Block 12 can go only into block 4 12 mod 8 Set associative Set no Block address mod of sets in cache Block 12 can go only into set 0 12 mod 4 but any block in that set If there are n blocks in a set the cache placement is called n way set associative Mapping Function Q2 How Is a Block Found If It Is in the Upper Level The address can be divided into two main parts Block offset selects the data from the block offset size log2 block size Block address tag index index selects set in cache index size log2 of sets log2 of blocks set associativity tag compared to tag in cache to determine hit tag size address size index size offset size Page 3 3 Mapping Structure Mapping Function Main Memory Consider 64KB cache 4 Byte data transfer block size between main memory and cache Cache Organized in 16K block frames slots of 4 Bytes each Column Number 0 1 Row Number Main memory has 16MBytes Or we can treat the memory as having 4M blocks of 4 Bytes each 24 bit memory address Cache Row Number Column Number Direct Mapping Mapping Structure Mapping Rows Columns Main Memory Rows Columns Cache Direct 256 16K 1 16K Associative 4M 1 16K 1 Set Associative 512 8K 2 8K 8 Bits Tag 14 Bits index No 2 Page 4 4 Associative Mapping Set Associative Mapping Tag 9 2 22 bit tag Index 13 Block Offset 2 Q3 Which Block Should be Replaced on a Miss Address Match to Cache Direct Mapped No Choice only one candidate Tag Index Block Offset Set Associative or Fully Associative Random easier to implement but less efficient Least Recently used harder to implement but more efficient 2 way set associative cache Valid bit v Tag search Miss rates for caches with different size associativity and replacement algorithm Data block 8 3 Page 5 5 Q4 What Happens on a Write Q4 What Happens on a Write Write through The information is written to both the block in the cache and to the block in the lower level memory combined with write buffers so it won t wait for lower level memory access Write back The information is written only to the block in the cache The modified cache block is written to main memory only when it is replaced Since data does not have to be brought into the cache on a write miss there are two options Write allocate The block is brought into the cache on a write miss Used mostly with write back caches Hope subsequent accesses to the block hit in cache No write allocate The block is modified in memory but not brought into the cache Used with write through caches Writes have to go to memory anyway so why bring the block into the cache is block clean or dirty add a dirty bit to each block Pros and Cons of each obvious ones Write through read misses cannot result in writes to memory easier to implement main memory is always up to date less coherence problem Write back Less memory traffic Perform writes at the speed of the cache Cache Measures Miss Rate Pitfall Hit rate fraction found in the cache Miss Rate to Memory Hierarchy Performance is like MIPS to CPU Performance So high that we usually talk about Miss rate 1 Hit Rate Hit time time to access the cache Miss penalty time to fetch a block from lower level including time to replace in CPU Increasing block size generally decreases miss rate but the accompanying increase in miss penalty may outweigh the decrease in miss rate Average memory access time is a more reliable measure of cache performance access time time to access lower level transfer time time to transfer block Average memory access time Hit time Miss rate x Miss penalty ns or clocks Page 6 6 Implications For CPU Example Alpha 21064 Data Cache The data cache of the Alpha 21064 has the following features Fast hit check for every memory access Hit is the common case Unpredictable memory access time 10s of clock cycles wait 1000s of clock cycles Interrupt switch do something else New style multithreaded execution How to handle miss 10s HW 1000s SW Writes in Alpha 21064 8 KB of data 32 byte blocks Direct mapped placement Write through no write allocate 4 block write buffer 34 bit physical address composed of 5 bit block offset 8 bit index 256 blocks 8192 32x1 21 bit tag Example Alpha 21064 Data Cache No write merging vs write merging in write buffer A cache read has 4 steps 1 The address from the cache is divided into the tag index and block offset 2 The index selects block 3 The address tag is compared with the tag in the cache the valid bit is checked and data to be loaded is selected
View Full Document