CS 152 Computer Architecture and Engineering Lecture 6 Memory Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http www eecs berkeley edu krste http inst eecs berkeley edu cs152 Last time in Lecture 5 Control hazards branches interrupts are most difficult to handle as they change which instruction should be executed next Speculation commonly used to reduce effect of control hazards predict sequential fetch predict no exceptions Branch delay slots make control hazard visible to software Precise exceptions stop cleanly on one instruction all previous instructions completed no following instructions have changed architectural state To implement precise exceptions in pipeline shift faulting instructions down pipeline to commit point where exceptions are handled in program order 2 10 2009 CS152 Spring 09 2 CPU Memory Bottleneck CPU Memory Performance of high speed computers is usually limited by memory bandwidth latency Latency time for a single access Memory access time Processor cycle time Bandwidth number of accesses per unit time if fraction m of instructions access memory 1 m memory references instruction CPI 1 requires 1 m memory refs cycle assuming MIPS RISC ISA 2 10 2009 CS152 Spring 09 3 Core Memory Core memory was first large scale reliable main memory invented by Forrester in late 40s early 50s at MIT for Whirlwind project Bits stored as magnetization polarity on small ferrite cores threaded onto 2 dimensional grid of wires Coincident current pulses on X and Y wires would write cell and also sense original state destructive reads Robust non volatile storage Used on space shuttle computers until recently Cores threaded onto wires by hand 25 billion a year at peak production Core access time 1 s DEC PDP 8 E Board 4K words x 12 bits 1968 2 10 2009 CS152 Spring 09 4 Semiconductor Memory DRAM Semiconductor memory began to be competitive in early 1970s Intel formed to exploit market for semiconductor memory First commercial DRAM was Intel 1103 1Kbit of storage on single chip charge on a capacitor used to hold value Semiconductor memory quickly replaced core in 70s 2 10 2009 CS152 Spring 09 5 One Transistor Dynamic RAM 1 T DRAM Cell word access transistor VREF TiN top electrode VREF Ta2O5 dielectric bit Storage capacitor FET gate trench stack poly word line 2 10 2009 W bottom electrode access transistor CS152 Spring 09 6 DRAM Architecture Col 1 M word lines Row 1 Row Address Decoder N N M bit lines Col 2M Row 2N Column Decoder Sense Amplifiers Data Memory cell one bit D Bits stored in 2 dimensional arrays on chip Modern chips have around 4 logical banks on each chip each logical bank physically implemented as many smaller arrays 2 10 2009 CS152 Spring 09 7 DRAM Packaging Clock and control signals 7 Address lines multiplexed row column address 12 DRAM chip Data bus 4b 8b 16b 32b DIMM Dual Inline Memory Module contains multiple chips with clock control address signals connected in parallel sometimes need buffers to drive signals to all chips Data pins work together to return wide word e g 64 bit data bus using 16x4 bit parts 2 10 2009 CS152 Spring 09 8 DRAM Operation Three steps in read write access to a given bank Row access RAS decode row address enable addressed row often multiple Kb in row bitlines share charge with storage cell small change in voltage detected by sense amplifiers which latch whole row of bits sense amplifiers drive bitlines full rail to recharge storage cells Column access CAS decode column address to select small number of sense amplifier latches 4 8 16 or 32 bits depending on DRAM package on read send latched bits out to chip pins on write change sense amplifier latches which then charge storage cells to required value can perform multiple column accesses on same row without another row access burst mode Precharge charges bit lines to known value required before next row access Each step has a latency of around 15 20ns in modern DRAMs Various DRAM standards DDR RDRAM have different ways of encoding the signals for transmission to the DRAM but all share same core architecture 2 10 2009 CS152 Spring 09 9 Double Data Rate DDR2 DRAM 200MHz Clock Row Column Precharge Row Data Micron 256Mb DDR2 SDRAM datasheet 2 10 2009 CS152 Spring 09 400Mb s Data Rate 10 Processor DRAM Gap latency Proc 60 year CPU Moore s Law Processor Memory Performance Gap grows 50 year 100 10 DRAM 7 year 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 1 1982 1983 1984 1985 1986 1987 DRAM 1980 1981 Performance 1000 Time Four issue 2GHz superscalar accessing 100ns DRAM could execute 800 instructions during time for one memory access 2 10 2009 CS152 Spring 09 11 Typical Memory Reference Patterns Address n loop iterations Instruction fetches Stack accesses subroutine call subroutine return argument access Data accesses cto e v ss e cc ra scalar accesses Time 2 10 2009 CS152 Spring 09 Common Predictable Patterns Two predictable properties of memory references Temporal Locality If a location is referenced it is likely to be referenced again in the near future Spatial Locality If a location is referenced it is likely that locations near it will be referenced in the near future 2 10 2009 CS152 Spring 09 Memory Address one dot per access Memory Reference Patterns Temporal Locality Spatial Locality Time Donald J Hatfield Jeanette Gerald Program Restructuring for Virtual Memory IBM Systems Journal 10 3 168 192 1971 Multilevel Memory Strategy Reduce average latency using small fast memories called caches Caches are a mechanism to reduce memory latency based on the empirical observation that the patterns of memory references made by a processor are often highly predictable loop ADD r2 r1 r1 SUBI r3 r3 1 BNEZ r3 loop 2 10 2009 PC 96 100 104 108 112 CS152 Spring 09 Memory Hierarchy A CPU Small Fast Memory RF SRAM B Big Slow Memory DRAM holds frequently used data capacity Register SRAM DRAM latency Register SRAM DRAM bandwidth on chip off chip why why why On a data access hit data fast memory low latency access miss data fast memory long latency access DRAM 2 10 2009 CS152 Spring 09 16 Relative Memory Cell Sizes DRAM on memory chip On Chip SRAM in logic chip QuickTime and a TIFF Uncompressed decompressor are needed to see this picture Foss Implementing Application Specific Memory ISSCC 1996 QuickTime and a TIFF Uncompressed decompressor are needed to see this picture 2 10 2009 CS152 Spring 09 17 Management of Memory Hierarchy Small fast
View Full Document
Unlocking...