Cache MemoriesOct 11, 2001Topics• Generic cache memory organization• Direct mapped caches• Set associative caches• Impact of caches on performanceclass14.ppt15-213“The course that gives CMU its Zip!”CS 213 F’01– 2 –class14.pptCache memoriesmainmemoryI/Obridgebus interfaceL2 cacheALUregister fileCPU chipcache bus system bus memory busL1 cacheCache memories are small, fast SRAM-based memoriesmanaged automatically in hardware.• Hold frequently accessed blocks of main memoryCPU looks first for data in L1, then in L2, then in mainmemory.Typical bus structure:CS 213 F’01– 3 –class14.pptInserting an L1 cache between the CPU and main memorya b c dblock 10p q r sblock 21......w x y zblock 30...The big slow main memoryhas room for many 4-wordblocks.The small fast L1 cache has roomfor two 4-word blocks.The tiny, very fast CPU register filehas room for four 4-byte words.The transfer unit betweenthe cache and main memory is a 4-word block(16 bytes).The transfer unit betweenthe CPU register file and the cache is a 4-byte block.line 0line 1CS 213 F’01– 4 –class14.pptGeneral organization of a cache memory• • • B–110• • • B–110validvalidtagtagset 0:B = 2b bytesper cache blockE lines per setS = 2s setst tag bitsper line1 valid bitper lineCache size: C = B x E x S data bytes• • •• • • B–110• • • B–110validvalidtagtagset 1:• • •• • • B–110• • • B–110validvalidtagtagset S-1:• • •• • •Cache is an arrayof sets.Each set containsone or more lines.Each line holds ablock of data.CS 213 F’01– 5 –class14.pptAddressing cachest bits s bitsb bits0m-1<tag> <set index> <block offset>Address A:• • • B–110• • • B–110vvtagtagset 0:• • •• • • B–110• • • B–110vvtagtagset 1:• • •• • • B–110• • • B–110vvtagtagset S-1:• • •• • •The word at address A is in the cache ifthe tag bits in one of the <valid> lines in set <set index> match <tag>.The word contents begin at offset <block offset> bytes from the beginning of the block.CS 213 F’01– 6 –class14.pptDirect-mapped cacheSimplest kind of cacheCharacterized by exactly one line per set.validvalidvalidtagtagtag• • •set 0:set 1:set S-1:E=1 lines per setcache blockcache blockcache blockCS 213 F’01– 7 –class14.pptAccessing direct-mapped cachesvalidvalidvalidtagtagtag• • •set 0:set 1:set S-1:t bits s bits0 0 0 0 10m-1b bitstag set index block offsetselected setcache blockcache blockcache blockSet selection• Use the set index bits to determine the set of interest.CS 213 F’01– 8 –class14.pptAccessing direct-mapped cachesLine matching and word selection• find a valid line in the selected set with a matching tag (linematching)• then extract the word (word selection)1t bits s bits100i01100m-1b bitstag set index block offsetselected set (i):=1?= ?(3) If (1) and (2), then cache hit,and block offset selectsstarting byte. (1) The valid bit must be set(2) The tag bits in the cacheline must match thetag bits in the address0110 w3w0w1w230 1 2 74 5 6CS 213 F’01– 9 –class14.pptDirect-mapped cache simulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/setAddress trace (reads):0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000]xt=1 s=2 b=1xx x1 0 m[1] m[0]v tag data0 [0000] (miss)(1)1 0 m[1] m[0]v tag data1 1 m[13] m[12]13 [1101] (miss)(2)1 1 m[9] m[8]v tag data8 [1000] (miss)(3)1 0 m[1] m[0]v tag data1 1 m[13] m[12]0 [0000] (miss)(4)CS 213 F’01– 10 –class14.pptWhy use middle bits as index?High-Order Bit Indexing• Adjacent memory lines wouldmap to same cache entry• Poor use of spatial localityMiddle-Order Bit Indexing• Consecutive memory lines mapto different cache lines• Can hold C-byte region ofaddress space in cache at onetime4-line CacheHigh-OrderBit IndexingMiddle-OrderBit Indexing0001101100000001001000110100010101100111100010011010101111001101111011110000000100100011010001010110011110001001101010111100110111101111CS 213 F’01– 11 –class14.pptSet associative cachesCharacterized by more than one line per setvalid tagset 0:E=2 lines per setset 1:set S-1:• • •cache blockvalid tag cache blockvalid tag cache blockvalid tag cache blockvalid tag cache blockvalid tag cache blockCS 213 F’01– 12 –class14.pptAccessing set associative cachesSet selection• identical to direct-mapped cachevalidvalidtagtagset 0:validvalidtagtagset 1:validvalidtagtagset S-1:• • •t bits s bits0 0 0 0 10m-1b bitstag set index block offsetSelected setcache blockcache blockcache blockcache blockcache blockcache blockCS 213 F’01– 13 –class14.pptAccessing set associative cachesLine matching and word selection• must compare the tag in each valid line in the selected set.1 0110 w3w0w1w21 1001t bits s bits100i01100m-1b bitstag set index block offsetselected set (i):=1?= ?(3) If (1) and (2), thencache hit, and block offset selectsstarting byte.(2) The tag bits in oneof the cache lines mustmatch the tag bits inthe address(1) The valid bit must be set.30 1 2 74 5 6CS 213 F’01– 14 –class14.pptMulti-level cachessize:speed:$/Mbyte:line size:200 B3 ns8 B8-64 KB3 ns32 B128 MB DRAM60 ns$1.50/MB8 KB30 GB8 ms$0.05/MBlarger, slower, cheaperMemoryMemorydiskdiskTLBL1 IcacheL1 DcacheregsL2 CacheProcessor1-4MB SRAM6 ns$100/MB32 Blarger line size, higher associativity, more likely to write backOptions: separate data and instruction caches, or a unified cacheCS 213 F’01– 15 –class14.pptProcessor ChipIntel Pentium cache hierarchyL1 Data1 cycle latency16KB4-way assocWrite-through32B linesL1 Instruction16KB, 4-way32B linesRegs.L2 Unified128KB--2 MB4-way assocWrite-backWrite allocate32B linesMainMemoryUp to 4GBCS 213 F’01– 16 –class14.pptCache performance metricsMiss Rate• fraction of memory references not found in cache(misses/references)• Typical numbers:3-10% for L1can be quite small (e.g., < 1%) for L2, depending on size, etc.Hit Time• time to deliver a line in the cache to the processor (includes time todetermine whether the line is in the cache)• Typical numbers:1 clock cycle for L13-8 clock cycles for L2Miss Penalty• additional time required because of a miss–Typically 25-100 cycles for main memoryCS 213 F’01– 17 –class14.pptWriting cache friendly codeRepeated references to variables are good (temporallocality)Stride-1 reference patterns are good (spatial locality)Example• cold cache, 4-byte words, 4-word cache blocksint
View Full Document