CS152 Computer Architecture and Engineering Lecture 19 Locality and Memory TechnologyReview: Explicit Renaming/Limits to ILPRecall: Upper Limit to ILP: Ideal MachineRecall: More Realistic HW: Branch ImpactThe Big Picture: Where are We Now?Recall: Who Cares About the Memory Hierarchy?Recall: Memory Hierarchy of a Modern Computer SystemImpact of Memory Hierarchy on AlgorithmsQuicksort vs. Radix as vary number keys: InstructionsQuicksort vs. Radix as vary number keys: Instrs & TimeQuicksort vs. Radix as vary number keys: Cache missesMain Memory BackgroundRandom Access Memory (RAM) TechnologyStatic RAM CellTypical SRAM Organization: 16-word x 4-bitLogic Diagram of a Typical SRAMTypical SRAM TimingProblems with SRAMAdministrative IssuesMain Memory Deep Background1-Transistor Memory Cell (DRAM)DRAM Capacitors: more capacitance in a small areaClassical DRAM Organization (square)DRAM logical organization (4 Mbit)DRAM physical organization (4 Mbit)Logic Diagram of a Typical (Asynchronous) DRAMDRAM Read TimingDRAM Write TimingKey DRAM Timing ParametersDRAM PerformanceSomething new: Structure of Tunneling Magnetic JunctionMain Memory PerformanceSlide 33Increasing Bandwidth - InterleavingSlide 35Independent Memory BanksFewer DRAMs/System over TimeFast Page Mode OperationSlide 39What does “Syncronous” RAM mean?Example: SDRAM timing for Lab6DRAMs over TimeDRAM HistoryDRAM Design GoalsToday’s Situation: DRAMSlide 46The Art of Memory System DesignA Summary on Sources of Cache MissesExample: 1 KB Direct Mapped Cache with 32 B BlocksSet Associative CacheDisadvantage of Set Associative CacheExample: Fully AssociativeSummaryCS152Computer Architecture and EngineeringLecture 19Locality and Memory TechnologyApril 12, 2004John Kubiatowicz (www.cs.berkeley.edu/~kubitron)lecture slides: http://inst.eecs.berkeley.edu/~cs152/4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec19.2Review: Explicit Renaming/Limits to ILP°Explicit Renaming: more physical registers than ISA. •Separates renaming from scheduling -Opens up lots of options for resolving RAW hazards•Rename table: tracks current association between architectural registers and physical registers•Potentially complicated rename table management°Multi-issue: simple matter of accounting•Must do dataflow analysis across multiple instructions simultaneously•Rename table updated as if instructions happened serially!°To sustain: need execution bandwidth+commit bandwidth•To sustain ILP of X need at least-X-way issue, > X execution bandwidth (for mix), X way commit °Limits to ILP•Inherent parallelism of applications as high as 150 IPC•Realistic limits rapidly reduce this to < 4 IPC for most applications4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec19.3ProgramsInstruction Issues per cycle020406080100120140160gcc espresso li fpppp doducd tomcatv54.862.617.975.2118.7150.1Integer: 18 - 60FP: 75 - 150IPCRecall: Upper Limit to ILP: Ideal Machine4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec19.4ProgramInstruction issues per cycle0102030405060gcc espresso li fpppp doducd tomcatv3541166158609121048156764613456674514452222941946PerfectSelective predictorStandard 2-bitStatic NoneChange from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycleProfileBHT (512)Pick Cor. or BHTPerfect No predictionFP: 15 - 45Integer: 6 - 12IPCRecall: More Realistic HW: Branch Impact4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec19.5°The Five Classic Components of a Computer°Today’s Topics: •Recap last lecture•Locality and Memory Hierarchy•Administrivia•SRAM Memory Technology•DRAM Memory Technology•Memory OrganizationThe Big Picture: Where are We Now? ControlDatapathMemoryProcessorInputOutput4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec19.6µProc60%/yr.(2X/1.5yr)DRAM9%/yr.(2X/10 yrs)110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000DRAMCPU1982Processor-MemoryPerformance Gap:(grows 50% / year)PerformanceTime“Moore’s Law”Processor-DRAM Memory Gap (latency)Recall: Who Cares About the Memory Hierarchy?“Less’ Law?”4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec19.7Recall: Memory Hierarchy of a Modern Computer System°By taking advantage of the principle of locality:•Present the user with as much memory as is available in the cheapest technology.•Provide access at the speed offered by the fastest technology.ControlDatapathSecondaryStorage(Disk)ProcessorRegistersMainMemory(DRAM)SecondLevelCache(SRAM)On-ChipCache1s10,000,000s (10s ms)Speed (ns): 10s 100s100s GsSize (bytes): Ks MsTertiaryStorage(Tape)10,000,000,000s (10s sec)Ts4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec19.8Impact of Memory Hierarchy on Algorithms°Today CPU time is a function of (ops, cache misses)°What does this mean to Compilers, Data structures, Algorithms?•Quicksort: fastest comparison based sorting algorithm when keys fit in memory•Radix sort: also called “linear time” sortFor keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys°“The Influence of Caches on the Performance of Sorting” by A. LaMarca and R.E. Ladner. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 1997, 370-379.•For Alphastation 250, 32 byte blocks, direct mapped L2 2MB cache, 8 byte keys, from 4000 to 40000004/12/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec19.9Quicksort vs. Radix as vary number keys: InstructionsJob size in keysInstructions/keyRadix sortQuicksort4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec19.10Quicksort vs. Radix as vary number keys: Instrs & TimeTimeJob size in keysInstructionsRadix sortQuicksort4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec19.11Quicksort vs. Radix as vary number keys: Cache missesCache missesJob size in keysRadix sortQuicksortWhat is proper approach to fast algorithms?4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec19.12°Performance of Main Memory: •Latency: Cache Miss Penalty-Access Time: time between request and word arrives-Cycle Time: time between requests•Bandwidth: I/O & Large Block Miss Penalty (L2)°Cache uses SRAM : Static Random Access Memory•No refresh (6 transistors/bit vs. 1 transistor)Size: DRAM/SRAM 4-8 Cost/Cycle time: SRAM/DRAM 8-16°Main Memory is DRAM : Dynamic Random Access Memory•Dynamic since needs to be refreshed periodically (8 ms)•Addresses divided into 2 halves (Memory as a 2D
View Full Document