Pipelining to SuperscalarSlide 2Limits of PipeliningProcessor PerformanceAmdahl’s LawRevisit Amdahl’s LawPipelined Performance ModelSlide 8Slide 9Motivation for Superscalar [Agerwala and Cocke]Superscalar ProposalLimits on Instruction Level Parallelism (ILP)Slide 13Classifying ILP MachinesSlide 15Slide 16Slide 17Slide 18Superscalar vs. SuperpipelinedSuperpipelining: Result LatencySuperscalar ChallengesPipelining to SuperscalarPipelining to SuperscalarProf. Mikko H. LipastUniversity of Wisconsin-MadisonLecture notes based on notes by John P. ShenUpdated by Mikko LipastPipelining to SuperscalarPipelining to SuperscalarForecast–Limits of pipelining–The case for superscalar–Instructon-level parallel machines–Superscalar pipeline organizaton–Superscalar pipeline designLimits of PipeliningLimits of PipeliningIBM RISC Experience–Control and data dependences add 15%–Best case CPI of 1.15, IPC of 0.87–Deeper pipelines (higher frequency) magnify dependence penaltesThis analysis assumes 100% cache hit rates–Hit rates approach 100% for some programs–Many important programs have much worse hit rates–Later!Processor PerformanceProcessor PerformanceIn the 1980’s (decade of pipelining):–CPI: 5.0 => 1.15In the 1990’s (decade of superscalar):–CPI: 1.15 => 0.5 (best case)In the 2000’s (decade of multcore):–Marginal CPI improvementProcessor Performance = ---------------Time ProgramInstructions Cycles ProgramInstructionTimeCycle (code size)=X X (CPI) (cycle time)Amdahl’s LawAmdahl’s Lawh = fracton of tme in serial codef = fracton that is vectorizablev = speedup for fOverall speedup:No. ofProcessorsNTime1h 1 - h1 - ffvffSpeedup11Revisit Amdahl’s LawRevisit Amdahl’s LawSequental bottleneckEven if v is infinite–Performance limited by nonvectorizable porton (1-f)fvffv1111limNo. ofProcessorsNTime1h 1 - h1 - ffPipelined Performance ModelPipelined Performance Modelg = fracton of tme pipeline is filled1-g = fracton of tme pipeline is not filled (stalled)1-ggPipelineDepthN1g = fracton of tme pipeline is filled1-g = fracton of tme pipeline is not filled (stalled)1-ggPipelineDepthN1Pipelined Performance ModelPipelined Performance ModelPipelined Performance ModelPipelined Performance ModelTyranny of Amdahl’s Law [Bob Colwell]–When g is even slightly below 100%, a big performance hit will result–Stalled cycles are the key adversary and must be minimized as much as possible1-ggPipelineDepthN1Motivation for SuperscalarMotivation for Superscalar[Agerwala and Cocke][Agerwala and Cocke]Typical RangeSpeedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar)Superscalar ProposalSuperscalar ProposalModerate tyranny of Amdahl’s Law–Ease sequental bottleneck–More generally applicable–Robust (less sensitve to f)–Revised Amdahl’s Law: vfsfSpeedup11Limits on Instruction Level Limits on Instruction Level Parallelism (ILP)Parallelism (ILP)Weiss and Smith [1984] 1.58Sohi and Vajapeyam [1987] 1.81Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck)Tjaden and Flynn [1973] 1.96Uht [1986] 2.00Smith et al. [1989] 2.00Jouppi and Wall [1988] 2.40Johnson [1991] 2.50Acosta et al. [1986] 2.79Wedig [1982] 3.00Butler et al. [1991] 5.8Melvin and Patt [1991] 6Wall [1991] 7 (Jouppi disagreed)Kuck et al. [1972] 8Riseman and Foster [1972] 51 (no control dependences)Nicolau and Fisher [1984] 90 (Fisher’s optimism)Superscalar ProposalSuperscalar ProposalGo beyond single instructon pipeline, achieve IPC > 1Dispatch multple instructons per cycleProvide more generally applicable form of concurrency (not just vectors)Geared for sequental code that is hard to parallelize otherwiseExploit fine-grained or instructon-level parallelism (ILP)Classifying ILP MachinesClassifying ILP Machines[Jouppi, DECWRL 1991]Baseline scalar RISC–Issue parallelism = IP = 1–Operaton latency = OP = 1–Peak IPC = 1123456IF DE EX WB1 2 3 4 5 6 7 8 90TIME IN CYCLES (OF BASELINE MACHINE)SUCCESSIVEINSTRUCTIONSClassifying ILP MachinesClassifying ILP Machines[Jouppi, DECWRL 1991]Superpipelined: cycle tme = 1/m of baseline–Issue parallelism = IP = 1 inst / minor cycle–Operaton latency = OP = m minor cycles–Peak IPC = m instr / major cycle (m x speedup?)12345IF DEEXWB6123456Classifying ILP MachinesClassifying ILP Machines[Jouppi, DECWRL 1991]Superscalar:–Issue parallelism = IP = n inst / cycle–Operaton latency = OP = 1 cycle–Peak IPC = n instr / cycle (n x speedup?)IFDEEXWB123456978Classifying ILP MachinesClassifying ILP Machines[Jouppi, DECWRL 1991]VLIW: Very Long Instructon Word–Issue parallelism = IP = n inst / cycle–Operaton latency = OP = 1 cycle–Peak IPC = n instr / cycle = 1 VLIW / cycleIF DEEXWBClassifying ILP MachinesClassifying ILP Machines[Jouppi, DECWRL 1991]Superpipelined-Superscalar–Issue parallelism = IP = n inst / minor cycle–Operaton latency = OP = m minor cycles–Peak IPC = n x m instr / major cycleIFDEEXWB123456978Superscalar vs. SuperpipelinedSuperscalar vs. SuperpipelinedRoughly equivalent performance–If n = m then both have about the same IPC–Parallelism exposed in space vs. tmeTime in Cycles (of Base Machine)01 2 3 4 5 6 78 9SUPERPIPELINED10 11 12 13SUPERSCALARKey:IFetchDcodeExecuteWritebackSuperpipelining: Result LatencySuperpipelining: Result LatencySuperpipelining - Jouppi, 1989essentially describes a pipelined execution stageJouppií s base machineUnderpipelined machineSuperpipelined machineUnderpipelined machines cannot issue instructions as fast as they are executedNote - key characteristic of Superpipelined machines is that results are not available to M-1 successive instructionsSuperscalar ChallengesSuperscalar ChallengesI-cacheFETCHDECODECOMMITD-cacheBranchPredictorInstructionBufferStoreQueueReorderBufferI ntegerFloating-pointMediaMemo ryInstructionRegisterData
View Full Document