Page 11/20/05 CS252-S05 Lec21Prof. David CullerElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05CS252Graduate Computer ArchitectureLecture 2Review of Instruction Sets, Pipelines, and Caches1/20/05 CS252-S05 Lec22Review, #1• Technology is changing rapidly:Capacity SpeedLogic 2x in 3 years 2x in 3 yearsDRAM 4x in 3 years 2x in 10 yearsDisk 4x in 3 years 2x in 10 yearsProcessor ( n.a.) 2x in 1.5 years• What was true five years ago is not necessarily true now.• Execution time is the REAL measure of computer performance!– Not clock rate, not CPI• “X is n times faster than Y” means:e(Y)Performance(X)Performanc ExTime(X)ExTime(y)=1/20/05 CS252-S05 Lec23Amdahl’s Law()enhancedenhancedenhancednewoldoverallSpeedupFraction Fraction 1 ExTimeExTime Speedup+−==1Best you could ever hope to do:()enhancedmaximumFraction - 11 Speedup =()+−×=enhancedenhancedenhancedoldnewSpeedupFractionFraction ExTime ExTime 11/20/05 CS252-S05 Lec24Today: Quick review of everything you should have learned1/20/05 CS252-S05 Lec25Computer Performance CPU time = Seconds = Instructions x Cycles x SecondsProgram Program Instruction CycleCPU time = Seconds = Instructions x Cycles x SecondsProgram Program Instruction CycleInst Count CPI Clock RateProgram XCompiler X (X)Inst. Set. X XOrganization X XTechnology Xinst countCPICycle time1/20/05 CS252-S05 Lec26Cycles Per Instruction (Throughput)“Instruction Frequency”CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count “Average Cycles per Instruction”jnjjI CPI TimeCycle time CPU ×∑×==1Count nInstructioI F where F CPI CPIjjnjjj=∑×==1Page 21/20/05 CS252-S05 Lec27Example: Calculating CPI bottom upTypical Mix of instruction typesin programBase Machine (Reg / Reg)Op Freq Cycles CPI(i) (% Time)ALU 50% 1 .5 (33%)Load 20% 2 .4 (27%)Store 10% 2 .2 (13%)Branch 20% 2 .4 (27%)1.5Design guideline: Make the common case fastMIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks.Run benchmark and collect workload characterization (simulate, machine counters, or sampling)1/20/05 CS252-S05 Lec28Example: Branch Stall Impact• Assume CPI = 1.0 ignoring branches (ideal)• Assume solution was stalling for 3 cycles• If 30% branch, Stall 3 cycles on 30% Op Freq Cycles CPI(i) (% Time)Other 70% 1 .7 (37%)Branch 30% 4 1.2 (63%)⇒ new CPI = 1.9• New machine is 1/1.9 = 0.52 times faster (i.e. slow!)1/20/05 CS252-S05 Lec29SPEC: System Performance Evaluation Cooperative• First Round 1989– 10 programs yielding a single number (“SPECmarks”)• Second Round 1992– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs)» Compiler Flags unlimited. March 93 of DEC 4000 Model 610:spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=memcpy(b,a,c)”wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas• Third Round 1995– new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point) – “benchmarks useful for 3 years”– Single flag setting for all programs: SPECint_base95, SPECfp_base95 • Fourth Round 2000: 26 apps– analysis and simulation programs– Compression: bzip2, gzip, – Integrated circuit layout, ray tracing, lots of others1/20/05 CS252-S05 Lec210SPEC First Round• One program: 99% of time in single line of code• New front-end compiler could improve dramaticallyBenchmark0100200300400500600700800gccepressospicedoducnasa7lieqntottmatrix300fpppptomcatv1/20/05 CS252-S05 Lec211Integrated Circuits CostsDie Cost goes roughly with die area4 Test_Die Die_Area 2Wafer_diam Die_Area2m/2)(Wafer_dia wafer per Dies −⋅×π−π=αα−×+×=Die_area sityDefect_Den 1 dWafer_yiel YieldDieyieldtest Finalcost Packaging cost Testingcost Die cost IC ++=yield Die Wafer per DiescostWafer cost Die×=1/20/05 CS252-S05 Lec212A "Typical" RISC• 32-bit fixed format instruction (3 formats)• 32 32-bit GPR (R0 contains zero, DP take pair)• 3-address, reg-reg arithmetic instruction• Single address mode for load/store: base + displacement– no indirection• Simple branch conditions• Delayed branchsee: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3Page 31/20/05 CS252-S05 Lec213Example: MIPS (- DLX)Op31 26 01516202125Rs1 RdimmediateOp31 26 025Op31 26 01516202125Rs1 Rs2targetRd OpxRegister-Register561011Register-ImmediateOp31 26 01516202125Rs1 Rs2/OpximmediateBranchJump / Call1/20/05 CS252-S05 Lec214Datapath vs Control• Datapath: Storage, FU, interconnect sufficient to perform the desired functions– Inputs are Control Points– Outputs are signals• Controller: State machine to orchestrate operation on the data path– Based on desired function and signalsDatapathControllerControl Pointssignals1/20/05 CS252-S05 Lec215Approaching an ISA• Instruction Set Architecture– Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing• Meaning of each instruction is described by RTL on architected registers and memory• Given technology constraints assemble adequate datapath– Architected storage mapped to actual storage– Function units to do all the required operations– Possible additional storage (eg. MAR, MBR, …)– Interconnect to move information among regs and FUs• Map each instruction to sequence of RTLs• Collate sequences into symbolic controller state transition diagram (STD)• Lower symbolic STD to control points• Implement controller1/20/05 CS252-S05 Lec2165 Steps of DLX DatapathFigure 3.1, Page 130MemoryAccessWriteBackInstructionFetchInstr. DecodeReg. FetchExecuteAddr. CalcLMDALUMUXMemoryReg FileMUXMUXDataMemoryMUXSignExtend4AdderZero?Next SEQ PCAddressNext PCWB DataInstRDRS1RS2ImmIR <= mem[PC];PC <= PC + 4Reg[IRrd] <= Reg[IRrs] opIRopReg[IRrt]1/20/05 CS252-S05 Lec2175 Steps of DLX DatapathFigure 3.4, Page 134MemoryAccessWriteBackInstructionFetchInstr. DecodeReg. FetchExecuteAddr. CalcALUMemoryReg FileMUXMUXDataMemoryMUXSignExtendZero?IF/IDID/EXMEM/WBEX/MEM4AdderNext SEQ PCNext SEQ PCRD RD RDWB DataNext PCAddressRS1RS2ImmMUXIR <= mem[PC]; PC <= PC + 4A <= Reg[IRrs]; B <= Reg[IRrt]rslt <= A opIRopBReg[IRrd] <= WBWB <= rslt1/20/05
View Full Document