Lecture 2: Review of Instruction Sets, Pipelines, and CachesReview, #1Review, #2Today: Quick review of everything you should have learned 0 ( A countably-infinite set of computer architecture concepts )Integrated Circuits CostsReal World ExamplesFinite State Machines:Implementation as Combinational logic + LatchMicroprogrammed ControllersPipelining: Its Natural!Sequential LaundryPipelined Laundry Start work ASAPPipelining LessonsComputer PipelinesA "Typical" RISCExample: MIPS ( DLX)5 Steps of DLX Datapath Figure 3.1, Page 1305 Steps of DLX Datapath Figure 3.4, Page 137Visualizing Pipelining Figure 3.3, Page 133Its Not That Easy for ComputersOne Memory Port/Structural Hazards Figure 3.6, Page 142One Memory Port/Structural Hazards Figure 3.7, Page 143Speed Up Equation for PipeliningExample: Dual-port vs. Single-portData Hazard on R1 Figure 3.9, page 147Three Generic Data HazardsSlide 27Slide 28CS 252 AdministriviaSlide 30Forwarding to Avoid Data Hazard Figure 3.10, Page 149HW Change for Forwarding Figure 3.20, Page 161Data Hazard Even with Forwarding Figure 3.12, Page 153Data Hazard Even with Forwarding Figure 3.13, Page 154Software Scheduling to Avoid Load HazardsControl Hazard on Branches Three Stage StallBranch Stall ImpactPipelined DLX Datapath Figure 3.22, page 163Four Branch Hazard AlternativesSlide 40Delayed BranchEvaluating Branch AlternativesNow, Review of Memory HierarchyRecap: Who Cares About the Memory Hierarchy?Levels of the Memory HierarchyThe Principle of LocalityMemory Hierarchy: TerminologyCache MeasuresSimplest Cache: Direct Mapped1 KB Direct Mapped Cache, 32B blocksTwo-way Set Associative CacheDisadvantage of Set Associative Cache4 Questions for Memory HierarchyQ1: Where can a block be placed in the upper level?Q2: How is a block found if it is in the upper level?Q3: Which block should be replaced on a miss?Q4: What happens on a write?Write Buffer for Write ThroughImpact of Memory Hierarchy on AlgorithmsQuicksort vs. Radix as vary number keys: InstructionsQuicksort vs. Radix as vary number keys: Instrs & TimeQuicksort vs. Radix as vary number keys: Cache misses5 minute Class BreakA Modern Memory HierarchyBasic Issues in VM System DesignAddress MapPaging OrganizationVirtual Address and a CacheTLBsTranslation Look-Aside BuffersReducing Translation TimeOverlapped Cache & TLB AccessProblems With Overlapped TLB AccessSummary #1/5: Control and PipeliningSummary #2/5: CachesSummary #3/5: The Cache Design SpaceSummary #4/5: TLB, Virtual MemorySummary #5/5: Memory HierachyJDK.F98 Slide 1Lecture 2: Review of Instruction Sets, Pipelines, and CachesProf. John KubiatowiczComputer Science 252Fall 1998JDK.F98 Slide 2Review, #1•Technology is changing rapidly:Capacity SpeedLogic 2x in 3 years 2x in 3 yearsDRAM 4x in 3 years 2x in 10 yearsDisk 4x in 3 years 2x in 10 years Processor ( n.a.) 2x in 1.5 years•What was true five years ago is not necessarily true now.•Execution time is the REAL measure of computer performance!–Not clock rate, not CPI•“X is n times faster than Y” means:e(Y)Performance(X)Performanc ExTime(X)ExTime(y)JDK.F98 Slide 3Review, #2•Amdahl’s Law: (or Law of Diminishing Returns)•CPI Law:•The “End to End Argument” is what RISC was ultimately about -- it is the performance of the complete system that matters, not individual components!CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction CycleCPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle enhancedenhancedenhancednewoldoverallSpeedupFraction Fraction 1 ExTimeExTime Speedup1JDK.F98 Slide 4Today: Quick review of everything you should have learned 0( A countably-infinite set of computer architecture concepts )JDK.F98 Slide 5 Integrated Circuits CostsDie Cost goes roughly with die area4 Test_Die Die_Area 2Wafer_diam Die_Area2m/2)(Wafer_dia wafer per Dies Die_area sityDefect_Den 1 dWafer_yiel YieldDieyieldtest Finalcost Packaging cost Testingcost Die cost IC yield Die Wafer per DiescostWafer cost DieJDK.F98 Slide 6Real World ExamplesChip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer386DX 2 0.90 $900 1.0 43 360 71% $4 486DX2 3 0.80 $1200 1.0 81 181 54% $12 PowerPC 601 4 0.80 $1700 1.3 121 115 28% $53 HP PA 7100 3 0.80 $1300 1.0 196 66 27% $73 DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149 SuperSPARC 3 0.70 $1700 1.6 256 48 13% $272 Pentium 3 0.80 $1500 1.5 296 40 9% $417 – From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15JDK.F98 Slide 7Finite State Machines:•System state is explicit in representation•Transitions between states represented as arrows with inputs on arcs.•Output may be either part of state or on arcsAlpha/0Delta/2Beta/1011001“Mod 3 Machine”Input (MSB first) 0 1 0 1 00 1 2 2 1106Mod 3111 10JDK.F98 Slide 8“Mealey Machine”“Moore Machine”Implementation as Combinational logic + LatchAlpha/0Delta/2Beta/10/01/01/10/10/01/1LatchCombinationalLogicI nput StateoldStatenewDiv000000110001001001111000110010010011JDK.F98 Slide 9Microprogrammed Controllers•State machine in which part of state is a “micro-pc”.–Explicit circuitry for incrementing or changing PC •Includes a ROM with “microinstructions”.–Controlled logic implements at least branches and jumpsROM(Instructions)AddrBranchPC+ 1MUXNext AddressControl0: forw 35 xxx1: b_no_obstacles 0002: back 10 xxx3: rotate 90 xxx4: goto 001Instruction BranchCombinational Logic/Controlled MachineState w/ AddressJDK.F98 Slide 10Pipelining: Its Natural!•Laundry Example•Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold•Washer takes 30 minutes•Dryer takes 40 minutes•“Folder” takes 20 minutesA B C DJDK.F98 Slide 11Sequential Laundry•Sequential laundry takes 6 hours for 4 loads•If they learned pipelining, how long would laundry take? ABCD30 40 20 30 40 20 30 40 20 30 40 206 PM7 8 91011MidnightTaskOrderTimeJDK.F98 Slide 12Pipelined LaundryStart work ASAP•Pipelined laundry takes 3.5 hours for 4 loads ABCD6 PM7 8 91011MidnightTaskOrderTime30 40 40 40 40 20JDK.F98 Slide 13Pipelining Lessons•Pipelining
View Full Document