Outline CPE 631 Caches Memory Hierarchy Four Questions for Memory Hierarchy Cache Performance Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic milenka ece uah edu http www ece uah edu milenka 1 Reduce the miss rate 2 Reduce the miss penalty or 3 Reduce the time to hit in the cache AM 2 LaCASA Processor DRAM Latency Gap Solution The Memory Hierarchy MH Processor 2x 1 5 year 1000 User sees as much memory as is available in cheapest technology and access it at the speed offered by the fastest technology CPU Performance Levels in Memory Hierarchy Lower Upper 100 Processor Processor Memory Performance Gap grows 50 year Control Memory 2x 10 years 10 Datapath DRAM AM LaCASA Speed Capacity Cost bit 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 Time 1990 1989 1988 1987 1986 1985 1984 1983 1982 1981 1 1980 AM 3 LaCASA Fastest Smallest Highest Slowest Biggest Lowest 4 1 Generations of Microprocessors Why hierarchy works Time of a full cache miss in instructions executed 1st Alpha 340 ns 5 0 ns 68 clks x 2 or 136 2nd Alpha 266 ns 3 3 ns 80 clks x 4 or 320 3rd Alpha 180 ns 1 7 ns 108 clks x 6 or 648 1 2X latency x 3X clock rate x 3X Instr clock 5X Probability of reference Address space AM AM 5 LaCASA Levels of the Memory Hierarchy Bl Y CPU Registers Hit time Miss Penalty Cache LaCASA Hit Rate the fraction of memory access found in the upper level Hit Time time to access the upper level RAM access time Time to determine hit miss Main Memory M Bytes 100ns 300ns 1 MByte Miss data needs to be retrieved from the lower level Bl Y AM 10s 100s K Bytes 1 10 ns 10 MByte Hit data appears in some block in the upper level Bl X Registers 100s Bytes 1s ns From Processor 6 Capacity Access Time Cost Lower Level Memory To Processor Bl X Temporal locality recently accessed items are likely to be accessed in the near future Keep them close to the processor Spatial locality items whose addresses are near one another tend to be referenced close together in time Move blocks consisted of contiguous words to the upper level LaCASA Cache Measures Upper Level Memory Rule of thumb Programs spend 90 of their execution time in only 10 of code Principle of locality Disk Miss rate 1 Hit Rate Miss penalty time to replace a block in the upper level time to retrieve the block from the lower level AM Average memory access time Hit time Miss rate x Miss penalty ns or clocks 7 10s G Bytes 10 ms 10 000 000 ns 0 0031 MByte Tape infinite sec min 0 0014 MByte LaCASA Upper Leve Staging Xfer Unit faster Instr Operands prog compiler Cache Blocks Memory 1 8 bytes cache cntl 8 128 bytes Pages OS 512 4K bytes Files user operator Mbytes Disk Tape Larger Lower Level 8 2 Four Questions for Memory Heir Q 1 Where can a block be placed in the upper level Block placement Q 2 How is a block found if it is in the upper level Block identification Q 3 Which block should be replaced on a miss Block replacement AM Direct Mapped Cache direct mapped fully associative set associative Random LRU Least Recently Used Q 4 What happens on a write Write strategy Therefore we only need to look in a single location in the cache for the data if it exists in the cache Block is the unit of transfer between cache and memory AM Write through vs write back Write allocate vs No write allocate 9 LaCASA In a direct mapped cache each memory address is associated with one possible block within the cache 10 LaCASA Q1 Where can a block be placed in the upper level Direct Mapped Cache cont d Memory Block 12 placed in 8 block cache Memory 0 Address 1 2 3 4 5 6 7 8 9 A B C AM D E F Fully associative direct mapped 2 way set associative S A Mapping Block Number Modulo Number Sets Full Mapped 01234567 Direct Mapped 12 mod 8 4 01234567 2 Way Assoc 12 mod 4 0 00112233 Cache 1111111111222222222233 01234567890123456789012345678901 AM Memory LaCASA 11 LaCASA Cache Index 0 1 2 3 Cache 4 byte 12 3 Direct Mapped Cache cont d INDEX specifies the cache index which row of the cache we should look in OFFSET once we have found correct block specifies which byte within the block we want TAG the remaining bits after offset and index are determined these are used to distinguish between all the memory addresses that map to the same location BLOCK ADDRESS TAG INDEX Block Address tttttttttttttttttt iiiiiiiiii oooo TAG to check if have the correct block AM INDEX to select block OFFSET to select byte within the block AM 13 LaCASA 14 LaCASA Direct Mapped Cache Example Conditions 32 bit architecture word 32bits address unit is byte 8KB direct mapped cache with 4 words blocks For a 2 N byte cache Determine the size of the Tag Index and Offset fields AM 1 KB Direct Mapped Cache 32B blocks The uppermost 32 N bits are always the Cache Tag The lowest M bits are the Byte Select Block Size 2 M 9 31 Cache Tag OFFSET specifies correct byte within block cache block contains 4 words 16 24 bytes 4 bits INDEX specifies correct row in the cache cache size is 8KB 213 bytes cache block is 24 bytes Rows in cache 1 block 1 row 213 24 29 9 bits TAG Memory address length offset index Example 0x50 4 Byte Select Ex 0x01 Ex 0x00 Stored as part of the cache state Valid Bit Cache Tag Cache Data Byte 31 0x50 Byte 63 Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 32 4 9 19 tag is leftmost 19 bits AM 3 Byte 1023 LaCASA 0 Cache Index Since multiple memory addresses map to same cache index how do we tell which one is in there What if we have a block size 1 byte Result divide memory address into three fields 15 LaCASA Direct Mapped Cache Terminology Byte 992 31 16 4 Two way Set Associative Cache Disadvantage of Set Associative Cache N way set associative N entries for each Cache Index Example Two way set associative cache Valid Cache Index selects a set from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Cache Tag AM Adr Tag Compare Cache Data Cache Index Cache Data Cache Block 0 Cache Block 0 Sel1 1 Mux 0 Sel0 N way Set Associative Cache v Direct Mapped Cache N direct mapped caches operates in parallel N typically 2 to 4 In a direct mapped cache Cache Block is available BEFORE Hit Miss Cache Tag N comparators vs 1 Extra MUX delay for the data Data comes AFTER Hit Miss Possible to assume a hit and continue Recover later if miss Valid Valid Cache Tag AM …
View Full Document
Unlocking...