Topics• Memory Hierarchy–Locality of Reference• SRAM Caches–Direct Mapped–AssociativeCachesOctober 12, 2000class14.ppt15-213“The course that gives CMU its Zip!”CS 213 F’00– 2 –class14.pptComputer SystemdiskDiskdiskDiskMemory-I/O busMemory-I/O busProcessorProcessorCacheCacheMemoryMemoryI/OcontrollerI/OcontrollerI/OcontrollerI/OcontrollerI/OcontrollerI/OcontrollerDisplayDisplayNetworkNetworkinterruptCS 213 F’00– 3 –class14.pptLevels in Memory HierarchyCPUCPUregsregsCacheMemoryMemorydiskdisksize:speed:$/Mbyte:line size:200 B3 ns8 BRegister Cache Memory Disk Memory32 KB / 4MB4 ns$100/MB32 B128 MB60 ns$1.50/MB8 KB30 GB8 ms$0.05/MBlarger, slower, cheaper8 B 32 B 8 KBcache virtual memoryCS 213 F’00– 4 –class14.pptAlpha 21164 Chip CachesCaches:L1 dataL1 instructionL2 unifiedTLBBranch historyRight HalfL2Right HalfL2L1Instr.L1DataL2TagsL3 ControlCS 213 F’00– 5 –class14.pptLocality of ReferencePrinciple of Locality:• Programs tend to reuse data and instructions near those they haveused recently.• Temporal locality: recently referenced items are likely to bereferenced in the near future.• Spatial locality: items with nearby addresses tend to be referencedclose together in time.sum = 0;for (i = 0; i < n; i++)sum += a[i];*v = sum;Locality in Example:• Data–Reference array elements in succession (spatial)• Instructions–Reference instructions in sequence (spatial)–Cycle through loop repeatedly (temporal)CS 213 F’00– 6 –class14.pptCaching: The Basic IdeaMain Memory• Stores wordsA–Z in exampleCache• Stores subset of thewords4 in example• Organized in lines–Multiple words–To exploit spatial localityAccess• Word must be in cachefor processor to accessBig, Slow MemoryABC•••YZSmall,Fast CacheABGHProcessorCS 213 F’00– 7 –class14.pptBasic Idea (Cont.)Maintaining Cache:• Each time the processor performs a load or store, bring linecontaining the word into the cache–May need to evict existing line• Subsequent loads or stores to any word in line performed withincacheABGHInitialABCDRead CABCDYZCDRead ZRead DCache holds 2linesEach with 2wordsLoad line C+Dinto cache“Cache miss”Word already incache“Cache hit”Load line Y+Zinto cacheEvict oldestentryCS 213 F’00– 8 –class14.ppt• Between any two levels, memory is divided into lines (aka “blocks”)• Data moves between levels on demand, in line-sized chunks.• Invisible to application programmer–Hardware responsible for cache operation• Upper-level lines a subset of lower-level lines.aabAccess word w in line a (hit)aabAccess word v in line b (miss)wbababvAccessing Data in Memory HierarchyHighLevelLowLevelCS 213 F’00– 9 –class14.pptDesign Issues for CachesKey Questions:• Where should a line be placed in the cache? (line placement)• How is a line found in the cache? (line identification)• Which line should be replaced on a miss? (line replacement)• What happens on a write? (write strategy)Constraints:• Design must be very simple–Hardware realization–All decision making within nanosecond time scale• Want to optimize performance for “typical” programs–Do extensive benchmarking and simulations–Many subtle engineering tradeoffsCS 213 F’00– 10 –class14.pptDirect-Mapped CachesSimplest Design• Each memory line has a unique cache locationParameters• Line (or block) size B = 2b–Number of bytes in each line–Typically 2X–8X word size• Number of Sets S = 2s–Number of lines cache can hold• Total Cache Size = B*S = 2b+sPhysical Address• Address used to reference main memory• m bits to reference M = 2m total bytes• Partition into fields–Offset: Lower b bits indicate which byte within line–Set: Next s bits indicate how to locate line within cache–Tag: Identifies this line when in cachem-bit Physical Addresst s btag set index offsetCS 213 F’00– 11 –class14.pptIndexing into Direct-Mapped Cache• Use set index bitsto select cache setSet 0:0 1• • •B–1Tag Valid0 1• • •B–1Tag Valid0 1• • •B–1Tag ValidSet 1:Set S–1:•••t s btag set index offsetPhysical AddressCS 213 F’00– 12 –class14.pptDirect-Mapped Cache Tag MatchingIdentifying Line• Must have tag match highorder bits of address• Must have Valid = 10 1• • •B–1Tag ValidSelected Set:t s btag set index offsetPhysical Address= ?= 1?• Lower bits of addressselect byte or wordwithin cache lineCS 213 F’00– 13 –class14.pptDirect Mapped Cache SimulationM=16 byte addresses, B=2 bytes/line, S=4 sets, E=1 entry/setAddress trace (reads):0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000]xt=1 s=2 b=1xx x1 0 m[1] m[0]v tag data0 [0000] (miss)(1)1 0 m[1] m[0]v tag data1 1 m[13] m[12]13 [1101] (miss)(2)1 1 m[9] m[8]v tag data8 [1000] (miss)(3)1 0 m[1] m[0]v tag data1 1 m[13] m[12]0 [0000] (miss)(4)CS 213 F’00– 14 –class14.pptWhy Use Middle Bits as Index?High-Order Bit Indexing• Adjacent memory lines wouldmap to same cache entry• Poor use of spatial localityMiddle-Order Bit Indexing• Consecutive memory lines mapto different cache lines• Can hold C-byte region ofaddress space in cache at onetime4-line CacheHigh-OrderBit IndexingMiddle-OrderBit Indexing0001101100000001001000110100010101100111100010011010101111001101111011110000000100100011010001010110011110001001101010111100110111101111CS 213 F’00– 15 –class14.pptDirect Mapped Cache Implementation(DECStation 3100)tag31 30 29 .................. 19 18 17 16 15 14 13 .................. 5 4 3 2 1 0 setbyteoffsetvalid tag (16 bits) data (32 bits)data=hit16,384 setsCS 213 F’00– 16 –class14.pptProperties of Direct Mapped CachesStrength• Minimal control hardware overhead• Simple design• (Relatively) easy to make fastWeakness• Vulnerable to thrashing• Two heavily used lines have same cache index• Repeatedly evict one to make room for otherCache LineCS 213 F’00– 17 –class14.pptVector Product ExampleMachine• DECStation 5000• MIPS Processor with 64KB direct-mapped cache, 16 B line sizePerformance• Good case: 24 cycles / element• Bad case: 66 cycles / elementfloat dot_prod(float x[1024], y[1024]){ float sum = 0.0; int i; for (i = 0; i < 1024; i++) sum += x[i]*y[i]; return sum;}CS 213 F’00– 18 –class14.pptThrashing Example• Access one element from each array per
View Full Document