Unformatted text preview:

CS152 Computer Architecture and Engineering Lecture 19 Locality and Memory Technology April 12 2004 John Kubiatowicz www cs berkeley edu kubitron lecture slides http inst eecs berkeley edu cs152 Review Explicit Renaming Limits to ILP Explicit Renaming more physical registers than ISA Separates renaming from scheduling Opens up lots of options for resolving RAW hazards Rename table tracks current association between architectural registers and physical registers Potentially complicated rename table management Multi issue simple matter of accounting Must do dataflow analysis across multiple instructions simultaneously Rename table updated as if instructions happened serially To sustain need execution bandwidth commit bandwidth To sustain ILP of X need at least X way issue X execution bandwidth for mix X way commit Limits to ILP Inherent parallelism of applications as high as 150 IPC Realistic limits rapidly reduce this to 4 IPC for most applications 4 12 04 UCB Spring 2004 CS152 Kubiatowicz Recall Upper Limit to ILP Ideal Machine 160 FP 75 150 Instruction Issues per cycle IPC 140 120 Integer 18 60 150 1 118 7 100 75 2 80 60 54 8 62 6 40 17 9 20 0 gcc espresso li fpppp Programs 4 12 04 UCB Spring 2004 doducd tomcatv CS152 Kubiatowicz Recall More Realistic HW Branch Impact 61 60 Instruction issues per cycle IPC 50 Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle 58 48 60 FP 15 45 46 45 46 45 45 41 40 35 29 30 Integer 6 12 20 12 10 19 16 9 6 13 14 10 7 6 15 6 6 2 7 4 2 2 0 gcc espresso li fpppp doducd tomcatv Program Perfect Perfect 4 12 04 predictor Pick Selective Cor or BHT BHT 512 Standard 2 bit UCB Spring 2004 Profile Static None No prediction CS152 Kubiatowicz The Big Picture Where are We Now The Five Classic Components of a Computer Processor Input Control Memory Datapath Output Today s Topics Recap last lecture Locality and Memory Hierarchy Administrivia SRAM Memory Technology DRAM Memory Technology Memory Organization 4 12 04 UCB Spring 2004 CS152 Kubiatowicz Recall Who Cares About the Memory Hierarchy Processor DRAM Memory Gap latency Performance 1000 100 10 198 198 0 1 198 198 2 198 3 198 4 5 198 198 6 198 7 1 898 199 9 199 0 199 199 2 199 3 199 4 1 599 199 6 199 7 8 199 200 9 0 1 Proc 60 yr Moore s Law 2X 1 5yr Processor Memory Performance Gap grows 50 year Less Law DRAM DRAM 9 yr 2X 10 yrs CPU 4 12 04 Time UCB Spring 2004 CS152 Kubiatowicz Recall Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality Present the user with as much memory as is available in the cheapest technology Provide access at the speed offered by the fastest technology Processor Control On Chip Cache Registers Datapath Second Level Cache SRAM Main Memory DRAM Speed ns 1s 10s 100s Size bytes 100s Ks Ms 4 12 04 UCB Spring 2004 Secondary Storage Disk Tertiary Storage Tape 10 000 000s 10 000 000 000s 10s sec 10s Gsms Ts CS152 Kubiatowicz Impact of Memory Hierarchy on Algorithms Today CPU time is a function of ops cache misses What does this mean to Compilers Data structures Algorithms Quicksort fastest comparison based sorting algorithm when keys fit in memory Radix sort also called linear time sort For keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys The Influence of Caches on the Performance of Sorting by A LaMarca and R E Ladner Proceedings of the Eighth Annual ACM SIAM Symposium on Discrete Algorithms January 1997 370 379 For Alphastation 250 32 byte blocks direct mapped L2 2MB cache 8 byte keys from 4000 to 4000000 4 12 04 UCB Spring 2004 CS152 Kubiatowicz Quicksort vs Radix as vary number keys Instructions Radix sort Quick sort 4 12 04 Instructions key Job size in keys UCB Spring 2004 CS152 Kubiatowicz Quicksort vs Radix as vary number keys Instrs Time Radix sort Time Quick sort 4 12 04 Instructions Job size in keys UCB Spring 2004 CS152 Kubiatowicz Quicksort vs Radix as vary number keys Cache misses Radix sort Cache misses Quick sort Job size in keys What is proper approach to fast algorithms 4 12 04 UCB Spring 2004 CS152 Kubiatowicz Main Memory Background Performance of Main Memory Latency Cache Miss Penalty Access Time time between request and word arrives Cycle Time time between requests Bandwidth I O Large Block Miss Penalty L2 Cache uses SRAM Static Random Access Memory No refresh 6 transistors bit vs 1 transistor Size DRAM SRAM 4 8 Cost Cycle time SRAM DRAM 8 16 Main Memory is DRAM Dynamic Random Access Memory Dynamic since needs to be refreshed periodically 8 ms Addresses divided into 2 halves Memory as a 2D matrix 4 12 04 RAS or Row Address Strobe CAS or Column Address Strobe UCB Spring 2004 CS152 Kubiatowicz Random Access Memory RAM Technology Why do computer designers need to know about RAM technology Processor performance is usually limited by memory bandwidth As IC densities increase lots of memory will fit on processor chip Tailor on chip memory to specific needs Instruction cache Data cache Write buffer What makes RAM different from a bunch of flip flops Density RAM is much denser 4 12 04 UCB Spring 2004 CS152 Kubiatowicz Static RAM Cell 6 Transistor SRAM Cell 0 0 bit word word row select 1 1 bit Write 1 Drive bit lines bit 1 bit 0 bit bit 2 Select row replaced with pullup to save area Read 1 Precharge bit and bit to Vdd or Vdd 2 make sure equal 2 Select row 3 Cell pulls one line low 4 Sense amp on column detects difference between bit and bit 4 12 04 UCB Spring 2004 CS152 Kubiatowicz Typical SRAM Organization 16 word x 4 bit Din 3 Din 2 Din 1 Din 0 WrEn Precharge Wr Driver Precharger Wr Driver Precharger Wr Driver Precharger Wr Driver Precharger SRAM Cell SRAM Cell SRAM Cell SRAM Cell Word 1 SRAM Cell SRAM Cell SRAM Cell SRAM Cell Address Decoder Word 0 A0 A1 A2 A3 Word 15 SRAM Cell SRAM Cell SRAM Cell SRAM Cell Sense Amp Sense Amp Sense Amp Sense Amp Dout 3 Dout 2 Dout 1 Dout 0 4 12 04 UCB Spring 2004 Q Which is longer word line or bit line CS152 Kubiatowicz Logic Diagram of a Typical SRAM A N WE L OE L 2 N words x M bit SRAM M D Write Enable is usually active low WE L Din and Dout are combined to save pins A new control signal output enable OE L is needed WE L is asserted Low OE L is disasserted High D serves as the data input pin WE L is disasserted High OE L is asserted Low D is the data output pin Both WE L and OE L are asserted Result is unknown Don t do that Although could change …


View Full Document

Berkeley COMPSCI 152 - Locality and Memory Technology

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Loading Unlocking...
Login

Join to view Locality and Memory Technology and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Locality and Memory Technology and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?