CS 378 Programming for Performance Single-Thread Performance: Review of PipeliningPipelining: It’s Natural!Sequential LaundryPipelined Laundry: Start work ASAPPipelining LessonsThe Five Stages of LoadKey Ideas Behind Instruction PipeliningPipelining the Load InstructionA More Extensive Pipelining ExampleSingle Cycle vs. Multiple Cycle vs. PipelinedBasics of PipeliningNotations for Describing PipelinesBasic TermsSpeedup & Throughput of a Linear PipelineData Hazard: SetupData Hazard: DefinitionWhy Data Hazards OccurData Dependence and HazardsMore on HazardsThe Precedence RelationExample of Precedence RelationData Hazard: Effect on CompilerData Hazard: Effect on PipeliningSolution: Interlocks and StallingOptimization: Value ForwardingExample of ForwardingForwarding & StallingCompile-Time SchedulingCode Generation Examples for BranchesControl HazardMore on Control HazardsDelayed Branches on DLXDetails of Various Branch FlavorsPipelining Multicycle OperationsSpace-Time Diagram: Multicycle OperationsFloating-Point Operations in DLXCS 378Programming for PerformanceSingle-Thread Performance: Review of PipeliningSiddhartha ChatterjeeSpring 2008Spring 2008 Siddhartha Chatterjee 2A B C DPipelining: It’s Natural!Laundry ExampleAnn, Brian, Cathy, Dave each have one load of clothes to wash, dry, and foldWasher takes 30 minutesDryer takes 40 minutes“Folder” takes 20 minutesSpring 2008 Siddhartha Chatterjee 3Sequential laundry takes 6 hours for 4 loadsIf they learned pipelining, how long would laundry take? ABCD30 40 20 30 40 20 30 40 20 30 40 206 PM7 8 91011MidnightTaskOrderTimeSequential LaundrySpring 2008 Siddhartha Chatterjee 4Pipelined laundry takes 3.5 hours for 4 loads ABCD6 PM7 8 91011MidnightTaskOrderTime30 40 40 40 40 20Pipelined Laundry: Start work ASAPSpring 2008 Siddhartha Chatterjee 5ABCD6 PM7 8 9TaskOrderTime30 40 40 40 40 20Pipelining LessonsPipelining doesn’t help latency of single task, it helps throughput of entire workloadPipeline rate limited by slowest pipeline stageMultiple tasks operating simultaneouslyPotential speedup = Number pipe stagesUnbalanced lengths of pipe stages reduces speedupTime to “fill” pipeline and time to “drain” it reduces speedupSpring 2008 Siddhartha Chatterjee 6Ifetch: Instruction FetchFetch the instruction from the Instruction MemoryReg/Dec: Registers Fetch and Instruction DecodeExec: Calculate the memory addressMem: Read the data from the Data MemoryWrB: Write the data back to the register fileCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5Ifetch Reg/Dec Exec Mem WrBLoadThe Five Stages of LoadSpring 2008 Siddhartha Chatterjee 7The load instruction has 5 stages: Five independent functional units to work on each stage•Each functional unit is used only once!A second load can start doing Ifetch as soon as the first load finishes its Ifetch stageEach load still takes five cycles to complete•The latency of a single load is still 5 cyclesThe throughput is much higher•CPI approaches 1 •Cycle time is ~1/5th the cycle time of the single-cycle implementationInstructions start executing before previous instructions complete executionIfetch Reg/Dec Exec Mem WrBLoadKey Ideas Behind Instruction PipeliningCPI Cycle time Spring 2008 Siddhartha Chatterjee 8Pipelining the Load InstructionThe five independent pipeline stages are:Read next instruction: The Ifetch stageDecode instruction and fetch register values: The Reg/Dec stageExecute the operation: The Exec stageAccess data memory: The Mem stageWrite data to destination register: The WrB stageOne instruction enters the pipeline every cycleOne instruction comes out of the pipeline (completed) every cycleThe “effective” CPI is 7/3 (tends to 1); ~1/5 cycle timeClockCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7Ifetch Reg/Dec Exec Mem WrB1st lwIfetch Reg/Dec Exec Mem WrB2nd lwIfetch Reg/Dec Exec Mem WrB3rd lwSpring 2008 Siddhartha Chatterjee 9End of Cycle 4: Load’s Mem, R-type’s Exec, Store’s Reg, Beq’s IfetchEnd of Cycle 5: Load’s WrB, R-type’s Mem, Store’s Exec, Beq’s RegEnd of Cycle 6: R-type’s WrB, Store’s Mem, Beq’s ExecEnd of Cycle 7: Store’s WrB, Beq’s MemClockCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8Ifetch Reg/Dec Exec Mem WrB0: LoadIfetch Reg/Dec Exec Mem WrB4: R-typeIfetch Reg/Dec Exec Mem WrB8: StoreIfetch Reg/Dec Exec Mem WrB12: Beq (target is 1000)End ofCycle 4End ofCycle 5End ofCycle 6End ofCycle 7A More Extensive Pipelining ExampleSpring 2008 Siddhartha Chatterjee 10WrClkCycle 1Multiple Cycle Implementation:Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10Load Ifetch Reg Exec Mem WrIfetch Reg Exec MemLoad StorePipelined Implementation:Ifetch Reg Exec Mem WrStoreClkSingle Cycle Implementation:Load Store WasteIfetchR-typeIfetch Reg Exec Mem WrR-typeCycle 1 Cycle 2Ifetch Reg Exec MemSingle Cycle vs. Multiple Cycle vs. PipelinedSpring 2008 Siddhartha Chatterjee 11Basics of PipeliningTimeDiscrete time stepsRepresented as 1, 2, 3, …SpacePipe stages or segments (things that do processing)Represented as P, Q, R, S (or F, D, X, M, W for the DLX pipeline)OperandsInstructions or data itemsThings that flow through, and are processed by, the pipelineRepresented as a, b, c, …In drawing pipelines, we conceal the obvious fact that each operand undergoes some changes in each pipe stageSpring 2008 Siddhartha Chatterjee 12Notations for Describing Pipelines1 2 3 4 5 …P a b c d ?Q - a b c dR - - a b cS - - - a b1 2 3 4 5 …a P Q R Sb P Q R Sc P Q Rd P Q…•Space-time diagram, or Gantt chart•Reservation table by stages•Rows represent pipeline stages•Unbounded one way•Notation of HP•Reservation table by instructions•Rows represent operands•Unbounded both waysSpring 2008 Siddhartha Chatterjee 13Basic TermsFilling a pipelineFlushing or draining a pipelineStage or segment delayEach stage may have a different stage delayBeat time (= max stage delay)Number of stagesEnd-to-end latencynumber of stages × beat timeStages are separated by latches (registers)Spring 2008 Siddhartha Chatterjee 14Speedup & Throughput of a Linear PipelinettNntNTNNnNnnNTTnNnNtTtNntTnNtTtNnppseqppseq1)1( Throughput),min(1 pipeline of Speedup)CPI1( 11 pipelining with CPI)1( pipelining with timeExecution pipelining without timeExecution delay
View Full Document