Cache Memories September 30, 2008AnnouncementsGeneral cache mechanicsCache Performance MetricsLets think about those numbersMany types of cachesHardware cache memoriesInserting an L1 Cache Between the CPU and Main MemorySlide 9General Organization of a CacheSlide 11Addressing CachesSlide 13Example: Direct-Mapped CacheAccessing Direct-Mapped CachesSlide 16Slide 17Direct-Mapped Cache SimulationExample: Set Associative CacheAccessing Set Associative CachesSlide 21Slide 222-Way Associative Cache SimulationNotice that middle bits used as indexWhy Use Middle Bits as Index?Slide 26Sidebar: Multi-Level CachesWhat about writes?Software caches are more flexibleLocality Example #1Locality Example #2Locality Example #3Cache MemoriesSeptember 30, 2008TopicsTopicsGeneric cache memory organizationDirect mapped cachesSet associative cachesImpact of caches on performancelecture-10.ppt15-213“The course that gives CMU its Zip!”215-213, S’08AnnouncementsExam grading doneExam grading doneEveryone should have gotten email with score (out of 72)mean was 50, high was 70solution sample should be up on website soonGetting your exam backsome got them in recitationworking on plan for everyone else (worst case = recitation on Monday)If you think we made a mistake in gradingplease read the syllabus for details about the process for handling it315-213, S’08General cache mechanics0 1 2 34 5 6 78 9 10 1112 13 14 15Larger, slower, cheaper memoryis partitioned into “blocks”Data is copied betweenlevels in block-sized transfer units89 14 3Smaller, faster, more expensivememory caches a subset ofthe blocksCache:Memory:444101010From lecture-9.ppt415-213, S’08Cache Performance MetricsMiss RateMiss RateFraction of memory references not found in cache (misses / accesses)1 – hit rate Typical numbers (in percentages):3-10% for L1can be quite small (e.g., < 1%) for L2, depending on size, etc.Hit TimeHit TimeTime to deliver a line in the cache to the processorincludes time to determine whether the line is in the cacheTypical numbers:1-2 clock cycle for L15-20 clock cycles for L2Miss PenaltyMiss PenaltyAdditional time required because of a misstypically 50-200 cycles for main memory (Trend: increasing!)515-213, S’08Lets think about those numbersHuge difference between a hit and a missHuge difference between a hit and a miss100X, if just L1 and main memoryWould you believe 99% hits is twice as good as 97%?Would you believe 99% hits is twice as good as 97%?Consider these numbers:cache hit time of 1 cycle miss penalty of 100 cyclesSo, average access time is: 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cyclesThis is why “miss rate” is used instead of “hit rate”This is why “miss rate” is used instead of “hit rate”615-213, S’08Many types of cachesExamplesExamplesHardware: L1 and L2 CPU caches, TLBs, …Software: virtual memory, FS buffers, web browser caches, …Many common design issuesMany common design issueseach cached item has a “tag” (an ID) plus contentsneed a mechanism to efficiently determine whether given item is cachedcombinations of indices and constraints on valid locationson a miss, usually need to pick something to replace with the new itemcalled a “replacement policy”on writes, need to either propagate change or mark item as “dirty”write-through vs. write-backDifferent solutions for different cachesDifferent solutions for different cachesLets talk about CPU caches as a concrete example…715-213, S’08Hardware cache memoriesCache memories are small, fast SRAM-based Cache memories are small, fast SRAM-based memories managed automatically in hardwarememories managed automatically in hardwareHold frequently accessed blocks of main memoryCPU looks first for data in L1, then in main memoryCPU looks first for data in L1, then in main memoryTypical system structure:Typical system structure:mainmemorybus interfaceALUregister fileCPU chip busL1 cache815-213, S’08Inserting an L1 Cache Between the CPU and Main Memorya b c dblock 10p q r sblock 21......w x y zblock 30...The big slow main memory has room for many 4-word blocksThe tiny, very fast CPU register file has room for four 4-byte wordsThe transfer unit between the cache and main memory is a 4-word block (16 bytes)The transfer unit between the CPU register file and the cache is a 4-byte wordThe small fast L1 cache has room for two 4-word blocksline 0line 1915-213, S’08Inserting an L1 Cache Between the CPU and Main MemoryThe tiny, very fast CPU register file has room for four 4-byte wordsThe transfer unit between the cache and main memory is a 4-word block (16 bytes)The transfer unit between the CPU register file and the cache is a 4-byte wordThe small fast L1 cache has room for two 4-word blocksline 0line 1w w w wblock 10w w w wblock 21......w w w wblock 30...The big slow main memory has room for many 4-word blocks1015-213, S’08General Organization of a CacheB = 2b bytesper cache blockE lines per setS = 2s setst tag bitsper lineCache size: C = B x E x S data bytes•2•2• B–110•2•2• B–110validvalidtagtagset 0:•2•2••2•2• B–110•2•2• B–110validvalidtagtagset 1:•2•2••2•2• B–110•2•2• B–110validvalidtagtagset S-1:•2•2••2•2•Cache is an arrayof setsEach set containsone or more linesEach line holds ablock of data1 valid bit per line1115-213, S’08General Organization of a CacheB = 2b bytesper cache blockE lines per setS = 2s setst tag bitsper lineCache size: C = B x E x S data bytes•2•2• B–110•2•2• B–110validvalidtagtagset 0:•2•2••2•2• B–110•2•2• B–110validvalidtagtagset 1:•2•2••2•2• B–110•2•2• B–110validvalidtagtagset S-1:•2•2••2•2•Cache is an arrayof setsEach set containsone or more linesEach line holds ablock of data1 valid bit per line1215-213, S’08Addressing Cachest bits s bitsb bits<tag> <set index><block offset>0m-1Address A:•2•2•B–110•2•2•B–110vvtagtagset 0:•2•2••2•2•B–110•2•2•B–110vvtagtagset 1:•2•2••2•2•B–110•2•2•B–110vvtagtagset S-1:•2•2••2•2•The word at address A is in the cache ifthe tag bits in one of the <valid> lines in set <set index> match <tag>The word contents begin at offset <block offset> bytes from the beginning of the
View Full Document