CS252 Graduate Computer Architecture Lecture 2 Review of Instruction Sets, Pipelines, and CachesReview, #1Amdahl’s LawToday: Quick review of everything you should have learnedComputer PerformanceCycles Per Instruction (Throughput)Example: Calculating CPI bottom upExample: Branch Stall ImpactSPEC: System Performance Evaluation CooperativeSPEC First RoundIntegrated Circuits CostsA "Typical" RISCExample: MIPS ( DLX)Datapath vs ControlApproaching an ISA5 Steps of DLX Datapath Figure 3.1, Page 1305 Steps of DLX Datapath Figure 3.4, Page 134Inst. Set Processor ControllerSlide 19Visualizing Pipelining Figure 3.3, Page 133CS 252 AdministriviaPipelining is not quite that easy!One Memory Port/Structural Hazards Figure 3.6, Page 142One Memory Port/Structural Hazards Figure 3.7, Page 143Speed Up Equation for PipeliningExample: Dual-port vs. Single-portData Hazard on R1 Figure 3.9, page 147Three Generic Data HazardsSlide 29Slide 30Forwarding to Avoid Data Hazard Figure 3.10, Page 149HW Change for Forwarding Figure 3.20, Page 161Data Hazard Even with Forwarding Figure 3.12, Page 153Data Hazard Even with Forwarding Figure 3.13, Page 154Software Scheduling to Avoid Load HazardsControl Hazard on Branches Three Stage StallBranch Stall ImpactPipelined DLX Datapath Figure 3.22, page 163Four Branch Hazard AlternativesSlide 40Delayed BranchEvaluating Branch AlternativesNow, Review of Memory HierarchyRecap: Who Cares About the Memory Hierarchy?Levels of the Memory HierarchyThe Principle of LocalityMemory Hierarchy: TerminologyCache MeasuresSimplest Cache: Direct Mapped1 KB Direct Mapped Cache, 32B blocksTwo-way Set Associative CacheDisadvantage of Set Associative Cache4 Questions for Memory HierarchyQ1: Where can a block be placed in the upper level?Q2: How is a block found if it is in the upper level?Q3: Which block should be replaced on a miss?Q4: What happens on a write?Write Buffer for Write ThroughImpact of Memory Hierarchy on AlgorithmsQuicksort vs. Radix as vary number keys: InstructionsQuicksort vs. Radix as vary number keys: Instrs & TimeQuicksort vs. Radix as vary number keys: Cache missesA Modern Memory HierarchyWhat is virtual memory?Three Advantages of Virtual MemoryIssues in Virtual Memory System DesignLarge Address SpacesTranslation Look-Aside BuffersOverlapped Cache & TLB AccessProblems With Overlapped TLB AccessSummary #1/5: Control and PipeliningSummary #2/5: CachesSummary #3/5: The Cache Design SpaceSummary #4/5: TLB, Virtual MemorySummary #5/5: Memory Hierachy1/20/05CS252-S05 Lec21Prof. David CullerElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05CS252Graduate Computer ArchitectureLecture 2 Review of Instruction Sets, Pipelines, and Caches1/20/05CS252-S05 Lec22Review, #1•Technology is changing rapidly:Capacity SpeedLogic 2x in 3 years 2x in 3 yearsDRAM 4x in 3 years 2x in 10 yearsDisk 4x in 3 years 2x in 10 years Processor ( n.a.) 2x in 1.5 years•What was true five years ago is not necessarily true now.•Execution time is the REAL measure of computer performance!–Not clock rate, not CPI•“X is n times faster than Y” means:e(Y)Performance(X)Performanc ExTime(X)ExTime(y)1/20/05CS252-S05 Lec23Amdahl’s Law enhancedenhancedenhancednewoldoverallSpeedupFraction Fraction 1 ExTimeExTime Speedup1Best you could ever hope to do: enhancedmaximumFraction - 11 Speedup enhancedenhancedenhancedoldnewSpeedupFractionFraction ExTime ExTime 11/20/05CS252-S05 Lec24Today: Quick review of everything you should have learned1/20/05CS252-S05 Lec25Computer Performance CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction CycleCPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Inst Count CPI Clock RateProgram XCompiler X (X)Inst. Set. X XOrganization X XTechnology Xinst countCPICycle time1/20/05CS252-S05 Lec26Cycles Per Instruction (Throughput)“Instruction Frequency”CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count “Average Cycles per Instruction”jnjjI CPI TimeCycle time CPU 1Count nInstructioI F where F CPI CPIjjnjjj11/20/05CS252-S05 Lec27Example: Calculating CPI bottom upTypical Mix of instruction typesin programBase Machine (Reg / Reg)Op Freq Cycles CPI(i) (% Time)ALU 50% 1 .5 (33%)Load 20% 2 .4 (27%)Store 10% 2 .2 (13%)Branch 20% 2 .4 (27%) 1.5Design guideline: Make the common case fastMIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks.Run benchmark and collect workload characterization (simulate, machine counters, or sampling)1/20/05CS252-S05 Lec28Example: Branch Stall Impact•Assume CPI = 1.0 ignoring branches (ideal)•Assume solution was stalling for 3 cycles•If 30% branch, Stall 3 cycles on 30% Op Freq Cycles CPI(i) (% Time)Other 70% 1 .7 (37%)Branch 30% 4 1.2 (63%) new CPI = 1.9•New machine is 1/1.9 = 0.52 times faster (i.e. slow!)1/20/05CS252-S05 Lec29SPEC: System Performance Evaluation Cooperative•First Round 1989–10 programs yielding a single number (“SPECmarks”)•Second Round 1992–SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs)»Compiler Flags unlimited. March 93 of DEC 4000 Model 610:spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(b,a,c)”wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas•Third Round 1995–new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point) –“benchmarks useful for 3 years”–Single flag setting for all programs: SPECint_base95, SPECfp_base95 •Fourth Round 2000: 26 apps–analysis and simulation programs–Compression: bzip2, gzip, –Integrated circuit layout, ray tracing, lots of others1/20/05CS252-S05 Lec210SPEC First Round•One program: 99% of time in single line of code•New front-end compiler could improve dramaticallyBenchmarkSPEC Perf0100200300400500600700800gccepressospicedoducnasa7lieqntottmatrix300fpppptomcatv1/20/05CS252-S05 Lec211 Integrated Circuits CostsDie Cost goes roughly with die area4 Test_Die Die_Area 2Wafer_diam Die_Area2m/2)(Wafer_dia wafer per Dies
View Full Document