CS152 Computer Architecture and Engineering Lecture 21 Memory Systems (recap) CachesRecap: The Big Picture: Where are We Now?Recap: Who Cares About the Memory Hierarchy?Recap: Memory Hierarchy: Why Does it Work? Locality!Recap: Static RAM CellRecap: 1-Transistor Memory Cell (DRAM)Recap: Classical DRAM Organization (square)Recap: Traditional “asynchronous” DRAM Read TimingRecap: “Synchronous timing”: SDRAM timing for Lab6The Art of Memory System DesignImpact of Memory Hierarchy on AlgorithmsQuicksort vs. Radix as vary number keys: InstructionsQuicksort vs. Radix as vary number keys: Instrs & TimeQuicksort vs. Radix as vary number keys: Cache missesExample: 1 KB Direct Mapped Cache with 32 B BlocksSet Associative CacheDisadvantage of Set Associative CacheExample: Fully AssociativeAdministrative IssuesA Summary on Sources of Cache MissesDesign options at constant costRecap: Four Questions for Caches and Memory HierarchyQ1: Where can a block be placed in the upper level?Q2: How is a block found if it is in the upper level?Q3: Which block should be replaced on a miss?Q4: What happens on a write?Write Buffer for Write ThroughWrite Buffer SaturationRAW Hazards from Write Buffer!Write-miss Policy: Write Allocate versus Not AllocateHow Do you Design a Memory System?Impact on Cycle TimeImproving Cache Performance: 3 general optionsImproving Cache Performance3Cs Absolute Miss Rate (SPEC92)2:1 Cache Rule3Cs Relative Miss Rate1. Reduce Misses via Larger Block Size2. Reduce Misses via Higher AssociativityExample: Avg. Memory Access Time vs. Miss Rate3. Reducing Misses via a “Victim Cache”4. Reducing Misses by Hardware Prefetching5. Reducing Misses by Software Prefetching Data6. Reducing Misses by Compiler OptimizationsImproving Cache Performance (Continued)0. Reducing Penalty: Faster DRAM / Interface1. Reducing Penalty: Read Priority over Write on MissSlide 482. Reduce Penalty: Early Restart and Critical Word First3. Reduce Penalty: Non-blocking CachesWhat happens on a Cache miss?Value of Hit Under Miss for SPEC4. Reduce Penalty: Second-Level CacheReducing Misses: which apply to L2 Cache?L2 cache block size & A.M.A.T.Slide 56Example: Harvard ArchitectureSummary #1/ 2:Summary #2 / 2: The Cache Design Space4/21/03 ©UCB Spring 2003CS152 / Kubiatowicz Lec21.1CS152Computer Architecture and EngineeringLecture 21Memory Systems (recap)CachesApril 21, 2003John Kubiatowicz (www.cs.berkeley.edu/~kubitron)lecture slides: http://inst.eecs.berkeley.edu/~cs152/4/21/03 ©UCB Spring 2003CS152 / Kubiatowicz Lec21.2°The Five Classic Components of a Computer°Today’s Topics: •Recap last lecture•Simple caching techniques•Many ways to improve cache performance•Virtual memory?Recap: The Big Picture: Where are We Now? ControlDatapathMemoryProcessorInputOutput4/21/03 ©UCB Spring 2003CS152 / Kubiatowicz Lec21.3µProc60%/yr.(2X/1.5yr)DRAM9%/yr.(2X/10 yrs)110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000DRAMCPU1982Processor-MemoryPerformance Gap:(grows 50% / year)PerformanceTime“Moore’s Law”Processor-DRAM Memory Gap (latency)Recap: Who Cares About the Memory Hierarchy?“Less’ Law?”4/21/03 ©UCB Spring 2003CS152 / Kubiatowicz Lec21.4Recap: Memory Hierarchy: Why Does it Work? Locality!°Temporal Locality (Locality in Time):=> Keep most recently accessed data items closer to the processor°Spatial Locality (Locality in Space):=> Move blocks consists of contiguous words to the upper levels Lower LevelMemoryUpper LevelMemoryTo ProcessorFrom ProcessorBlk XBlk YAddress Space0 2^n - 1Probabilityof reference4/21/03 ©UCB Spring 2003CS152 / Kubiatowicz Lec21.5Recap: Static RAM Cell6-Transistor SRAM Cellbit bitword(row select)bit bitword°Write:1. Drive bit lines (bit=1, bit=0)2.. Select row°Read:1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!2.. Select row3. Cell pulls one line low4. Sense amp on column detects difference between bit and bitreplaced with pullupto save area100 14/21/03 ©UCB Spring 2003CS152 / Kubiatowicz Lec21.6Recap: 1-Transistor Memory Cell (DRAM)°Write:•1. Drive bit line•2.. Select row°Read:•1. Precharge bit line to Vdd/2•2.. Select row•3. Cell and bit line share charges-Very small voltage changes on the bit line•4. Sense (fancy sense amp)-Can detect changes of ~1 million electrons•5. Write: restore the value °Refresh•1. Just do a dummy read to every cell.row selectbitTrench Capacitor4/21/03 ©UCB Spring 2003CS152 / Kubiatowicz Lec21.7Recap: Classical DRAM Organization (square)°Row and Column Address Select 1 bit at a time°Act of reading refreshes one complete row•Sense amps detect slight variations from VDD/2 and amplify themrowdecoderrowaddressSense-AMPS, Column Selector & I/OColumnAddressdataRAM Cell Arrayword (row) selectbit (data) linesEach intersection representsa 1-T DRAM Cell4/21/03 ©UCB Spring 2003CS152 / Kubiatowicz Lec21.8ADOE_L256K x 8DRAM9 8WE_LCAS_LRAS_LOE_LA Row AddressWE_LJunkRead AccessTimeOutput EnableDelayCAS_LRAS_LCol Address Row Address JunkCol AddressD High Z Data OutDRAM Read Cycle TimeEarly Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L°Every DRAM access begins at:•The assertion of the RAS_L•2 ways to read: early or late v. CAS Junk Data Out High ZRecap: Traditional “asynchronous” DRAM Read Timing4/21/03 ©UCB Spring 2003CS152 / Kubiatowicz Lec21.9Recap: “Synchronous timing”: SDRAM timing for Lab6 °Micron 128M-bit dram (using 2Meg16bit4bank ver)•Row (12 bits), bank (2 bits), column (9 bits) RAS(New Bank)CASEnd RASxBurstREADCAS Latency4/21/03 ©UCB Spring 2003CS152 / Kubiatowicz Lec21.10Processor$MEMMemoryreference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .op: i-fetch, read, writeOptimize the memory system organizationto minimize the average memory access timefor typical workloadsWorkload orBenchmarkprogramsThe Art of Memory System Design4/21/03 ©UCB Spring 2003CS152 / Kubiatowicz Lec21.11Impact of Memory Hierarchy on Algorithms°Today CPU time is a function of (ops, cache misses)°What does this mean to Compilers, Data structures, Algorithms?•Quicksort: fastest comparison based sorting algorithm when keys fit in memory•Radix sort: also called “linear time” sortFor keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys°“The Influence of Caches on the Performance of
View Full Document