Cache Memories 15 213 The course that gives CMU its Zip Cache memories are small fast SRAMSRAM based memories managed automatically in hardware Hold frequently accessed blocks of main memory Cache Memories Oct 2 2003 CPU looks first for data in L1 then in L2 then in main memory Typical system structure Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Inserting an L1 Cache Between the CPU and Main Memory General Organization of a Cache 36 7 6 1 2 2 2 34 222 2 2 2 34 2 2 2 34 222 2 2 2 34 2 2 2 34 2 2 2 34 5 7 222 5 7 3 0 15 213 F 03 2 class12 ppt 5 15 213 F 03 222 4 5 63 7 15 213 F 03 Addressing Caches Addressing Caches 5 5 5 5 2 2 2 34 222 2 2 2 34 8 9 8 9 8 2 2 2 34 222 2 2 2 34 222 5 2 2 2 34 222 2 2 2 34 9 5 5 8 9 8 8 9 9 5 8 9 2 2 2 34 222 2 2 2 34 9 8 9 9 8 15 213 F 03 5 9 8 8 8 8 2 2 2 34 222 2 2 2 34 222 2 2 2 34 222 2 2 2 34 9 15 213 F 03 6 Direct Mapped Cache Accessing Direct Mapped Caches Simplest kind of cache easy to build only 1 tag compare required per access Set selection Use the set index bits to determine the set of interest Characterized by exactly one line per set 5 5 76 5 222 222 5 5 5 7 5 63 15 213 F 03 8 15 213 F 03 9 Accessing Direct Mapped Caches Accessing Direct Mapped Caches Line matching and word selection Line matching and word selection Line matching Find a valid line in the selected set with a matching tag Word selection Then extract the word 6 1 15 Line matching Find a valid line in the selected set with a matching tag Word selection Then extract the word 0 15 1 1 6 1 1 0 15 213 F 03 9 Direct Mapped Cache Simulation 6 6 6 6 0 6 36 15 213 F 03 10 Set Associative Caches 15 00002 00012 01112 10002 00002 5 5 76 222 11 Characterized by more than one line per set 76 0 1 7 8 0 0 5 7 15 213 F 03 12 15 213 F 03 Accessing Set Associative Caches Accessing Set Associative Caches Set selection Line matching and word selection identical to direct mapped cache 5 5 222 13 6 1 5 must compare the tag in each valid line in the selected set 15 1 1 6 15 213 F 03 Accessing Set Associative Caches Line matching and word selection Word selection is the same as in a direct mapped cache 15 213 F 03 2 Way Associative Cache Simulation 6 6 6 6 0 6 36 76 0 1 7 8 0 1 1 14 15 15 00002 00012 01112 10002 00002 15 15 213 F 03 16 0 15 213 F 03 Why Use Middle Bits as Index 00 01 10 11 HighHigh Order Bit Indexing Adjacent memory lines would map to same cache entry Poor use of spatial locality MiddleMiddle Order Bit Indexing Consecutive memory lines map to different cache lines Can hold S B E byte region of address space in cache at one time 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Maintaining a Set Associate Cache How to decide which cache line to use in a set 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Least Recently Used LRU Requires lg2 E extra bits Not recently Used NRU Random Virtual vs Physical addresses The memory system works with physical addresses but it takes time to translate a virtual to a physical address So most L1 caches are virtually indexed but physically tagged 15 213 F 03 17 15 213 F 03 18 Multi Level Caches What about writes Options separate data and instruction caches caches or a unified cache Multiple copies of data exist L1 L2 Main Memory Disk What to do when we write 5 3 5 A 0 B 3 5 5 3 3 3 0 A 3 3 3C 0 A 3 B3 D 3 A Write through Write back need a dirty bit What to do on a write miss 3 What to do on a replacement Depends on whether it is write through or write back 19 15 213 F 03 20 15 213 F 03 Intel Pentium III Cache Hierarchy Cache Performance Metrics Miss Rate Fraction of memory references not found in cache misses references Typical numbers 3 10 for L1 can be quite small e g 1 for L2 depending on size etc C 0 B 3 E 3 B B3 3 EE EE 33 0 B 3 3 33 Hit Time D D33 Time to deliver a line in the cache to the processor includes time to determine whether the line is in the cache Typical numbers 1 2 clock cycle for L1 5 20 clock cycles for L2 Miss Penalty 5 Additional time required because of a miss Typically 50 200 cycles for main memory Trend increasing 15 213 F 03 21 Writing Cache Friendly Code Detecting the Cache Parameters Repeated references to variables are good temporal locality locality How can one determine the cache parameters Size of cache Size of cache block Hit time Miss penalty Associatively Number of levels in memory hierarchy StrideStride 1 reference patterns are good spatial spatial locality locality Examples cold cache 4 byte words 4 word cache blocks int sum array rows int a M N int i j sum 0 int sum array cols int a M N int i j sum 0 for i 0 i M i for j 0 j N j sum a i j return sum for j 0 j N j for i 0 i M i sum a i j return sum 23 Complicating factors Prefetch support hardware and software Non blocking caches Hit under Miss support Superscalar processors with multiple concurrent memory operations Victim caches stream buffers line reservation 15 213 F 03 22 15 213 F 03 24 15 213 F 03 The Memory Mountain Memory Mountain Test Function Read throughput read bandwidth Number of bytes read from memory per second MB s Memory mountain Measured read throughput as a function of spatial and temporal locality Compact way to characterize memory system performance The test function void test int elems int stride int i result 0 volatile int sink for i 0 i elems i stride result data i sink result So compiler doesn t optimize away the loop Run test elems stride and return read throughput MB s double run int size int stride double Mhz double cycles int elems size sizeof int test elems stride warm up the …
View Full Document