Page 1Topics• Memory Hierarchy• Locality of Reference• Cache Design– Direct Mapped– AssociativeCachesMarch 16, 200015-213class18.pptCS 213 S’00– 2 –class18.pptComputer SystemdiskDiskdiskDiskMemory-I/O busMemory-I/O busProcessorProcessorCacheCacheMemoryMemoryI/OcontrollerI/OcontrollerI/OcontrollerI/OcontrollerI/OcontrollerI/OcontrollerDisplayDisplayNetworkNetworkinterruptCS 213 S’00– 3 –class18.pptLevels in Memory HierarchyCPUCPUregsregsCacheMemoryMemorydiskdisksize:speed:$/Mbyte:line size:200 B2 ns8 BRegister Cache Memory Disk Memory32KB - 4MB4 ns$100/MB32 B128 MB60 ns$1.50/MB8 KB20 GB8 ms$0.05/MBlarger, slower, cheaper8 B 32 B 8 KBcache virtual memoryCS 213 S’00– 4 –class18.pptAlpha 21164 Chip PhotoMicroprocessor Report 9/12/94Caches:L1 dataL1 instructionL2 unifiedTLBBranch historyPage 2CS 213 S’00– 5 –class18.pptAlpha 21164 Chip CachesCaches:L1 dataL1 instructionL2 unifiedTLBBranch historyRight HalfL2Right HalfL2L1Instr.L1DataL2TagsL3 ControlCS 213 S’00– 6 –class18.pptLocality of ReferencePrinciple of Locality:• Programs tend to reuse data and instructions near those they have used recently.• Temporal locality: recently referenced items are likely to be referenced in the near future.• Spatial locality: items with nearby addresses tend to be referenced close together in time.sum = 0;for (i = 0; i < n; i++)sum += a[i];*v = sum;Locality in Example:• Data– Reference array elements in succession (spatial)• Instructions– Reference instructions in sequence (spatial)– Cycle through loop repeatedly (temporal)CS 213 S’00– 7 –class18.pptCaching: The Basic IdeaMain Memory• Stores wordsA–Z in exampleCache• Stores subset of the words4 in example• Organized in lines– Multiple words– To exploit spatial localityAccess• Word must be in cache for processor to accessBig, Slow MemoryABC•••YZSmall,Fast CacheABGHProcessorCS 213 S’00– 8 –class18.pptBasic Idea (Cont.)Maintaining Cache:• Each time the processor performs a load or store, bring line containing the word into the cache– May need to evict existing line• Subsequent loads or stores to any word in line performed within cacheABGHInitialABCDRead CABCDYZCDRead ZRead DCache holds 2 linesEach with 2 wordsLoad line C+D into cache“Cache miss”Word already in cache“Cache hit”Load line Y+Z into cacheEvict oldest entryPage 3CS 213 S’00– 9 –class18.ppt• Between any two levels, memory is divided into lines (aka “blocks”)• Data moves between levels on demand, in line-sized chunks.• Invisible to application programmer– Hardware responsible for cache operation• Upper-level lines a subset of lower-level lines.aabAccess word w in line a (hit)aabAccess word v in line b (miss)wbababvAccessing Data in Memory HierarchyHighLevelLowLevelCS 213 S’00– 10 –class18.pptDesign Issues for CachesKey Questions:• Where should a line be placed in the cache? (line placement)• How is a line found in the cache? (line identification)• Which line should be replaced on a miss? (line replacement)• What happens on a write? (write strategy)Constraints:• Design must be very simple– Hardware realization– All decision making within nanosecond time scale• Want to optimize performance for “typical” programs– Do extensive benchmarking and simulations– Many subtle engineering tradeoffsCS 213 S’00– 11 –class18.pptDirect-Mapped CachesSimplest Design• Each memory line has a unique cache locationParameters• Line (aka block) size B = 2b– Number of bytes in each line– Typically 2X–8X word size• Number of Sets S = 2s– Number of lines cache can hold• Total Cache Size = B*S = 2b+sPhysical Address• Address used to reference main memory• n bits to reference N = 2ntotal bytes• Partition into fields– Offset: Lower b bits indicate which byte within line– Set: Next s bits indicate how to locate line within cache– Tag: Identifies this line when in cachen-bit Physical Addresst s btag set index offsetCS 213 S’00– 12 –class18.pptIndexing into Direct-Mapped Cache• Use set index bits to select cache setSet 0:0 1• • •B–1Tag Valid0 1• • •B–1Tag Valid0 1• • •B–1Tag ValidSet 1:Set S–1:•••t s btag set index offsetPhysical AddressPage 4CS 213 S’00– 13 –class18.pptDirect-Mapped Cache Tag MatchingIdentifying Line• Must have tag match high order bits of address• Must have Valid = 10 1• • •B–1Tag ValidSelected Set:t s btag set index offsetPhysical Address= ?= 1?• Lower bits of address select byte or word within cache lineCS 213 S’00– 14 –class18.pptDirect Mapped Cache SimulationN=16 byte addresses, B=2 bytes/line, S=4 sets, E=1 entry/setAddress trace (reads):0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000]0000000100100011010001010110011110001001101010111100110111101111xt=1 s=2 b=1xx x1 0 m[1] m[0]v tag data0 [0000] (miss)(1)1 0 m[1] m[0]v tag data1 1 m[13] m[12]13 [1101] (miss)(2)1 1 m[9] m[8]v tag data8 [1000] (miss)(3)1 0 m[1] m[0]v tag data1 1 m[13] m[12]0 [0000] (miss)(4)CS 213 S’00– 15 –class18.pptWhy Use Middle Bits as Index?High-Order Bit Indexing• Adjacent memory lines would map to same cache entry• Poor use of spatial localityMiddle-Order Bit Indexing• Consecutive memory lines map to different cache lines• Can hold N-byte region of address space in cache at one time4-line CacheHigh-OrderBit IndexingMiddle-OrderBit Indexing0001101100000001001000110100010101100111100010011010101111001101111011110000000100100011010001010110011110001001101010111100110111101111CS 213 S’00– 16 –class18.pptDirect Mapped Cache Implementation(DECStation 3100)tag setbyteoffsetvalid tag (16 bits) data (32 bits)data=hit16,384 sets31 30 29 .................. 19 18 17 16 15 14 13 .................. 5 4 3 2 1 0Page 5CS 213 S’00– 17 –class18.pptProperties of Direct Mapped CachesStrength• Minimal control hardware overhead• Simple design• (Relatively) easy to make fastWeakness• Vulnerable to thrashing• Two heavily used lines have same cache index• Repeatedly evict one to make room for otherCache LineCS 213 S’00– 18 –class18.pptVector Product ExampleMachine• DECStation 5000• MIPS Processor with 64KB direct-mapped cache, 16 B line sizePerformance• Good case: 24 cycles / element• Bad case: 66 cycles / elementfloat dot_prod(float x[1024], y[1024]){float sum = 0.0;int i;for (i = 0; i < 1024; i++)sum += x[i]*y[i];return sum;}CS 213 S’00– 19
View Full Document