Carnegie Mellon Introduction to Computer Systems 15 213 fall 2009 11th Lecture Oct 5th Instructors Majd Sakr and Khaled Harras Carnegie Mellon Last Time Memory hierarchy Here Core 2 Duo 4 MB 4 GB L2 unified cache Main Memory 500 GB L1 I cache 32 KB CPU Reg Throughput 16 B cycle Latency 3 cycles L1 D cache 8 B cycle 14 cycles 2 B cycle 100 cycles 1 B 30 cycles millions Disk Carnegie Mellon Last Time Locality Temporal locality Recently referenced items are likely to be referenced again in the near future block Spatial locality Items with nearby addresses tend to be referenced close together in time block Carnegie Mellon Today Cache organization Memory Wall Program optimization Matrix Multiplication Cache optimizations Cache Miss Analysis Carnegie Mellon Cache Memories Cache memories are small fast SRAM based memories managed automatically in hardware Hold frequently accessed blocks of main memory CPU looks first for data in L1 then in L2 then in main memory Typical system structure CPU chip SRAM Port L2 data register file L1 L2 ALU cache tags bus interface memory bus system bus I O bridge main memory Carnegie Mellon Inserting an L1 Cache Between the CPU and Main Memory The transfer unit between the CPU register file and the cache is a 4 byte block The transfer unit between the cache and main memory is a 4word block 16 bytes The tiny very fast CPU register file has room for four 4 byte words The small fast L1 cache has room for two 4 word blocks line 0 line 1 block 10 block 21 block 30 pqrs abcd wxyz The big slow main memory has room for many 4 word blocks Carnegie Mellon General Cache Organization S E B E 2e lines per set set line S 2s sets v tag 0 1 2 B 1 Cache size S x E x B data bytes valid bit B 2b bytes per cache block the data Carnegie Mellon Cache Read E 2e lines per set Locate set Check if any line in set has matching tag Yes line valid hit Locate data starting at offset Address of word t bits S 2s sets tag s bits b bits set block index offset data begins at this offset v tag 0 1 2 B 1 valid bit B 2b bytes per cache block the data Carnegie Mellon Example Direct Mapped Cache E 1 Direct mapped One line per set Assume cache block size 8 bytes 4 5 6 7 v tag 0 1 2 3 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 S 2s sets Address of int t bits 0 01 find set 100 Carnegie Mellon Example Direct Mapped Cache E 1 Direct mapped One line per set Assume cache block size 8 bytes valid match assume yes hit v tag Address of int t bits 0 1 2 3 4 5 6 7 block offset 0 01 100 Carnegie Mellon Example Direct Mapped Cache E 1 Direct mapped One line per set Assume cache block size 8 bytes Address of int valid match assume yes hit v tag t bits 0 1 2 3 4 5 6 7 block offset int 4 Bytes is here No match old line is evicted and replaced 0 01 100 Carnegie Mellon Direct Mapped Cache Simulation M 16 byte addresses B 2 bytes block S 4 sets E 1 entry set t 1 s 2 b 1 x xx x Address trace reads 0 00002 1 00012 7 01112 8 10002 0 00002 v tag data 0 1 0 1 M 0 1 M 8 9 1 0 M 6 7 miss hit miss miss miss Carnegie Mellon E way Set Associative Cache Here E 2 E 2 Two lines per set Assume cache block size 8 bytes Address of short int t bits v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 0 01 100 find set Carnegie Mellon E way Set Associative Cache Here E 2 E 2 Two lines per set Assume cache block size 8 bytes Address of short int t bits compare both valid match yes hit v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 block offset 0 01 100 Carnegie Mellon E way Set Associative Cache Here E 2 E 2 Two lines per set Assume cache block size 8 bytes Address of short int t bits match both 0 01 valid match yes hit v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 block offset short int 2 Bytes is here No match One line in set is selected for eviction and replacement Replacement policies random least recently used LRU 100 Carnegie Mellon 2 Way Associative Cache Simulation M 16 byte addresses B 2 bytes block S 2 sets E 2 entry set t 2s 1 b 1 xx x x Address trace reads 0 00002 1 00012 7 01112 8 10002 0 00002 v tag data 0 1 1 0 00 10 M 0 1 M 8 9 1 0 0 01 M 6 7 miss hit miss miss hit Carnegie Mellon Why Use Middle Bits as Index High Order 4 line Cache 00 01 10 11 High Order Bit Indexing Adjacent memory lines would map to same cache entry Poor use of spatial locality Middle Order Bit Indexing Consecutive memory lines map to different cache lines Can hold S B E byte region of address space in cache at one time Bit Indexing 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Middle Order Bit Indexing 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Carnegie Mellon What about writes Multiple copies of data exist L1 L2 Main Memory Disk What to do on a write hit Write through write immediately to memory Write back defer write to memory until replacement of line Need a dirty bit line different from memory or not What to do on a write miss Write allocate load into cache update line in cache Good if more writes to the location follow No write allocate writes immediately to memory Typical Write through No write allocate Write back Write allocate Carnegie Mellon Cache Performance Metrics Miss Rate Fraction of memory references not found in cache misses accesses 1 hit rate Typical numbers in percentages 3 10 for L1 can be quite small e g 1 for L2 depending on size etc Hit Time Time to deliver a …
View Full Document