CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-IILast time in Lecture 6Relative Memory Cell SizesPlacement PolicyDirect-Mapped Cache2-Way Set-Associative CacheFully Associative CacheReplacement PolicyBlock Size and Spatial LocalityCPU-Cache Interaction (5-stage pipeline)Improving Cache PerformanceCauses for Cache MissesEffect of Cache Parameters on PerformanceWrite Policy ChoicesWrite PerformanceReducing Write Hit TimePipelining Cache WritesCS152 AdministriviaWrite Buffer to Reduce Read Miss PenaltySerial-versus-Parallel Cache and Memory accessBlock-level OptimizationsSet-Associative RAM-Tag CacheAcknowledgementsCS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-IIKrste AsanovicElectrical Engineering and Computer SciencesUniversity of California at Berkeleyhttp://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs1522/14/2008 CS152-Spring’082Last time in Lecture 6•Dynamic RAM (DRAM) is main form of main memory storage in use today–Holds values on small capacitors, need refreshing (hence dynamic)–Slow multi-step access: precharge, read row, read column•Static RAM (SRAM) is faster but more expensive–Used to build on-chip memory for caches•Caches exploit two forms of predictability in memory reference streams–Temporal locality, same location likely to be accessed again soon–Spatial locality, neighboring location likely to be accessed soon•Cache holds small set of values in fast memory (SRAM) close to processor–Need to develop search scheme to find values in cache, and replacement policy to make space for newly accessed locations2/14/2008 CS152-Spring’083Relative Memory Cell SizesQuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.[ Foss, “Implementing Application-Specific Memory”, ISSCC 1996 ]QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.DRAM on memory chipOn-Chip SRAM in logic chip2/14/2008 CS152-Spring’084Placement Policy0 1 2 3 4 5 6 70 1 2 3Set NumberCache Fully (2-way) Set DirectAssociative Associative Mappedanywhere anywhere in only into set 0 block 4 (12 mod 4) (12 mod 8)0 1 2 3 4 5 6 7 8 91 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 92 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 93 30 1MemoryBlock Numberblock 12 can be placed2/14/2008 CS152-Spring’08Direct-Mapped Cache Tag Data Block V =BlockOffset Tag Index t k b tHITData Word or Byte 2klines2/14/2008 CS152-Spring’082-Way Set-Associative Cache Tag Data Block V =BlockOffset Tag Index t k bHIT Tag Data Block VDataWordor Byte = t2/14/2008 CS152-Spring’08Fully Associative Cache Tag Data Block V =BlockOffset Tag t bHITDataWordor Byte = = t2/14/2008 CS152-Spring’088Replacement PolicyIn an associative cache, which block from a set should be evicted when the set becomes full?• Random• Least Recently Used (LRU)• LRU cache state must be updated on every access• true implementation only feasible for small sets (2-way)• pseudo-LRU binary tree often used for 4-8 way• First In, First Out (FIFO) a.k.a. Round-Robin• used in highly associative caches• Not Least Recently Used (NLRU)• FIFO with exception for most recently used block or blocksThis is a second-order effect. Why?Replacement only happens on misses2/14/2008 CS152-Spring’089Word3Word0 Word1 Word2Block Size and Spatial LocalityLarger block size has distinct hardware advantages• less tag overhead• exploit fast burst transfers from DRAM• exploit fast burst transfers over wide bussesWhat are the disadvantages of increasing block size?block address offsetb2b = block size a.k.a line size (in bytes)Split CPU addressb bits32-b bitsTagBlock is unit of transfer between the cache and memory4 word block, b=2Fewer blocks => more conflicts. Can waste bandwidth.2/14/2008 CS152-Spring’0810CPU-Cache Interaction(5-stage pipeline)PCaddrinstPrimaryInstructionCache0x4AddIRDnophit?PCenDecode,RegisterFetchwdataRaddrwdatardataPrimaryData CacheweABYYALUMD1MD2Cache Refill Data from Lower Levels of Memory Hierarchyhit?Stall entire CPU on data cache missTo Memory ControlMEWhat about Instruction miss or writes to i-stream ?2/14/2008 CS152-Spring’0811Improving Cache PerformanceAverage memory access time =Hit time + Miss rate x Miss penaltyTo improve performance:•reduce the hit time•reduce the miss rate•reduce the miss penaltyWhat is the simplest design strategy?Biggest cache that doesn’t increase hit time past 1-2 cycles (approx 8-32KB in modern technology)[design issues more complex with out-of-order superscalar processors]2/14/2008 CS152-Spring’0812Causes for Cache Misses• Compulsory: first-reference to a block a.k.a. cold start misses- misses that would occur even with infinite cache• Capacity: cache is too small to hold all data needed by the program- misses that would occur even under perfect replacement policy• Conflict: misses that occur because of collisions due to block-placement strategy- misses that would not occur with full associativity2/14/2008 CS152-Spring’0813Effect of Cache Parameters on Performance• Larger cache size+ reduces capacity and conflict misses - hit time will increase• Higher associativity+ reduces conflict misses- may increase hit time• Larger block size+ reduces compulsory and capacity (reload) misses- increases conflict misses and miss penalty2/14/2008 CS152-Spring’0814Write Policy Choices •Cache hit:–write through: write both cache & memory»generally higher traffic but simplifies cache coherence–write back: write cache only (memory is written only when the entry is evicted)»a dirty bit per block can further reduce the traffic•Cache miss:–no write allocate: only write to main memory–write allocate (aka fetch on write): fetch into cache•Common combinations:–write through and no write allocate–write back with write allocate2/14/2008 CS152-Spring’0815Write Performance TagData V =BlockOffset Tag Index t k b tHITData Word or Byte 2klinesWE2/14/2008 CS152-Spring’0816Reducing Write Hit TimeProblem: Writes take two cycles in memory stage, one cycle for tag check plus one cycle for data write if hitSolutions:•Design data RAM that can perform read and write in one cycle, restore old value after tag miss•Fully-associative (CAM Tag) caches: Word line only enabled if hit•Pipelined writes: Hold write data for store in single buffer ahead of cache, write cache
View Full Document