Cache MemoriesNew Topic: CacheSlide 3Inserting an L1 Cache Between the CPU and Main MemoryGeneral Org of a Cache MemoryAddressing CachesDirect-Mapped CacheAccessing Direct-Mapped CachesSlide 9Direct-Mapped Cache SimulationWhy Use Middle Bits as Index?Set-Associative CachesAccessing Set Associative CachesSlide 14Multi-Level CachesIntel Pentium Cache HierarchyCache Performance MetricsWrite StrategiesWriting Cache-Friendly CodeThe Memory MountainMemory Mountain Test FunctionMemory Mountain Main RoutineSlide 23Ridges of Temporal LocalityA Slope of Spatial LocalityMatrix-Multiplication ExampleMiss-Rate Analysis for Matrix MultiplyLayout of C Arrays in Memory (review)Matrix Multiplication (ijk)Matrix Multiplication (jik)Matrix Multiplication (kij)Matrix Multiplication (ikj)Matrix Multiplication (jki)Matrix Multiplication (kji)Summary of Matrix MultiplicationPentium Matrix Multiply PerformanceImproving Temporal Locality by BlockingBlocked Matrix Multiply (bijk)Blocked Matrix Multiply AnalysisPentium Blocked Matrix Multiply PerformanceConcluding ObservationsCache MemoriesCache MemoriesTopicsTopicsGeneric cache memory organizationDirect mapped cachesSet associative cachesImpact of caches on performancecache.pptCS 105Tour of the Black Holes of Computing– 2 –CS105New Topic: CacheNew Topic: CacheBuffer, between processor and memoryBuffer, between processor and memoryOften several levels of cachesSmall but fastSmall but fastOld values will be removed from cache to make space for new valuesCapitalizes on spatial locality and temporal localityCapitalizes on spatial locality and temporal localitySpatial locality: If a value is used, nearby values are likely to be usedTemporal locality: If a value is used, it is likely to be used again soon.Parameters vary by system; unknown to programmerParameters vary by system; unknown to programmer““Cache friendly” codeCache friendly” code– 3 –CS105Cache MemoriesCache MemoriesCache memories are small, fast SRAM-based memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. managed automatically in hardware. Hold frequently accessed blocks of main memoryCPU looks first for data in L1, then in L2, then in main CPU looks first for data in L1, then in L2, then in main memory.memory.Typical bus structure:Typical bus structure:mainmemoryI/Obridgebus interfaceL2 cacheALUregister fileCPU chipcache bus system bus memory busL1 cache– 4 –CS105Inserting an L1 Cache Between the CPU and Main MemoryInserting an L1 Cache Between the CPU and Main Memorya b c dblock 10p q r sblock 21......w x y zblock 30...The big slow main memoryhas room for many 4-wordblocks.The small fast L1 cache has roomfor two 4-word blocks.The tiny, very fast CPU register filehas room for four 4-byte words.The transfer unit betweenthe cache and main memory is a 4-word block(16 bytes).The transfer unit betweenthe CPU register file and the cache is a 4-byte block.line 0line 1– 5 –CS105General Org of a Cache MemoryGeneral Org of a Cache Memory••• B–110••• B–110validvalidtagtagset 0:B = 2b bytesper cache blockE lines per setS = 2s setst tag bitsper line1 valid bitper lineCache size: C = B x E x S data bytes•••••• B–110••• B–110validvalidtagtagset 1:•••••• B–110••• B–110validvalidtagtagset S-1:••••••Cache is an arrayof sets.Each set containsone or more lines.Each line holds ablock of data.– 6 –CS105Addressing CachesAddressing Cachest bits s bitsb bits0m-1<tag> <set index> <block offset>Address A:••• B–110••• B–110vvtagtagset 0:•••••• B–110••• B–110vvtagtagset 1:•••••• B–110••• B–110vvtagtagset S-1:••••••The word at address A is in the cache ifthe tag bits in one of the <valid> lines in set <set index> match <tag>.The word contents begin at offset <block offset> bytes from the beginning of the block.– 7 –CS105Direct-Mapped CacheDirect-Mapped CacheSimplest kind of cacheSimplest kind of cacheCharacterized by exactly one line per set.Characterized by exactly one line per set.validvalidvalidtagtagtag•••set 0:set 1:set S-1:E=1 lines per setcache blockcache blockcache block– 8 –CS105Accessing Direct-Mapped CachesAccessing Direct-Mapped CachesSet selectionSet selectionUse the set index bits to determine the set of interest.validvalidvalidtagtagtag•••set 0:set 1:set S-1:t bits s bits0 0 0 0 10m-1b bitstag set index block offsetselected setcache blockcache blockcache block– 9 –CS105Accessing Direct-Mapped CachesAccessing Direct-Mapped CachesLine matching and word selectionLine matching and word selectionLine matching: Find a valid line in the selected set with a matching tagWord selection: Then extract the word1t bits s bits100i01100m-1b bitstag set index block offsetselected set (i):(3) If (1) and (2), then cache hit,and block offset selectsstarting byte. =1?(1) The valid bit must be set= ?(2) The tag bits in the cacheline must match thetag bits in the address0110w3w0w1w230 1 2 74 5 6– 10 –CS105Direct-Mapped Cache SimulationDirect-Mapped Cache SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/setAddress trace (reads):0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]xt=1 s=2 b=1xx x1 0 m[1] m[0]v tag data0 [00002] (miss)(1)1 0 m[1] m[0]v tag data1 1 m[13] m[12]13 [11012] (miss)(3)1 1 m[9] m[8]v tag data8 [10002] (miss)(4)1 0 m[1] m[0]v tag data1 1 m[13] m[12]0 [00002] (miss)(5)0 M[0-1]11 M[12-13]11 M[8-9]11 M[12-13]10 M[0-1]11 M[12-13]10 M[0-1]1– 11 –CS105Why Use Middle Bits as Index?Why Use Middle Bits as Index?High-Order Bit IndexingHigh-Order Bit IndexingAdjacent memory lines would map to same cache entryPoor use of spatial localityMiddle-Order Bit IndexingMiddle-Order Bit IndexingConsecutive memory lines map to different cache linesCan hold C-byte region of address space in cache at one time4-line CacheHigh-OrderBit IndexingMiddle-OrderBit Indexing0001101100000001001000110100010101100111100010011010101111001101111011110000000100100011010001010110011110001001101010111100110111101111– 12 –CS105Set-Associative CachesSet-Associative CachesCharacterized by more than one line per setCharacterized by more than one line per setvalid tagset 0:E=2 lines
View Full Document