Slide 1Pipelining: It’s Natural!Sequential LaundryPipelined Laundry: Start work ASAPPipelining LessonsThe Five Stages of a Load InstructionKey Ideas Behind Instruction PipeliningPipelining the Load InstructionSingle Cycle vs. Multiple Cycle vs. PipelinedBasics of PipeliningBasic TermsSpeedup & Throughput of a Linear PipelineA More Extensive Pipelining ExampleData Hazard: SetupData Hazard: DefinitionData Hazard: Effect on CompilerWhy Data Hazards OccurData Dependence and HazardsMore on HazardsData Hazard: Effect on PipeliningSolution: Interlocks and StallingOptimization: Value ForwardingExample of ForwardingForwarding & StallingCompile-Time SchedulingControl HazardMore on Control HazardsDelayed Branches on DLXPipelining Multicycle OperationsSpace-Time Diagram: Multicycle OperationsFloating-Point Operations in DLXCS 378Programming for PerformanceSingle-Thread Performance: PipeliningAdopted fromSiddhartha ChatterjeeSpring 20092A B C DPipelining: It’s Natural!Laundry ExampleAnn, Brian, Cathy, Dave each have one load of clothes to wash, dry, and foldWasher takes 30 minutesDryer takes 40 minutes“Folder” takes 20 minutes3Sequential laundry takes 6 hours for 4 loadsIf they learned pipelining, how long would laundry take? ABCD30 40 20 30 40 20 30 40 20 30 40 206 PM7 8 91011MidnightTaskOrderTimeSequential Laundry4Pipelined laundry takes 3.5 hours for 4 loads ABCD6 PM7 8 91011MidnightTaskOrderTime30 40 40 40 40 20Pipelined Laundry: Start work ASAP5ABCD6 PM7 8 9TaskOrderTime30 40 404040 20Pipelining LessonsPipelining doesn’t help latency of single task, it helps throughput of entire workloadPipeline rate limited by slowest pipeline stageMultiple tasks operating simultaneouslyPotential speedup = Number pipe stagesUnbalanced lengths of pipe stages reduces speedupTime to “fill” pipeline and time to “drain” it reduces speedup6Ifetch: Instruction fetchFetch the instruction from the instruction memoryReg/Dec: Instruction decode and register readExec: Calculate the memory addressMem: Read the data from the data memoryWrB: Write the data back to the register fileCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5Ifetch Reg/Dec Exec Mem WrBLoadThe Five Stages of a Load Instruction7The load instruction has 5 stages: Five independent functional units to work on each stage•Each functional unit is used only once!A second load can start doing Ifetch as soon as the first load finishes its Ifetch stageEach load still takes five cycles to complete•The latency of a single load is still 5 cyclesThe throughput is much higher•CPI approaches 1 •Cycle time is ~1/5th the cycle time of the single-cycle implementationInstructions start executing before previous instructions complete executionIfetch Reg/Dec Exec Mem WrBLoadKey Ideas Behind Instruction PipeliningCPI Cycle time 8Pipelining the Load InstructionThe five independent pipeline stages are:Read next instruction: The Ifetch stageDecode instruction and fetch register values: The Reg/Dec stageExecute the operation: The Exec stageAccess data memory: The Mem stageWrite data to destination register: The WrB stageOne instruction enters the pipeline every cycleOne instruction comes out of the pipeline (completed) every cycleThe “effective” CPI is 7/3 (tends to 1)ClockCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7Ifetch Reg/Dec Exec Mem WrB1st lwIfetch Reg/Dec Exec Mem WrB2nd lwIfetch Reg/Dec Exec Mem WrB3rd lw9WrClkCycle 1Multiple Cycle Implementation:Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10LoadIfetch Reg Exec Mem WrIfetch Reg Exec MemLoad StorePipelined Implementation:Ifetch Reg Exec Mem WrStoreClkSingle Cycle Implementation:Load Store WasteIfetchR-typeIfetch Reg Exec Mem WrR-typeCycle 1 Cycle 2Ifetch Reg Exec MemSingle Cycle vs. Multiple Cycle vs. Pipelined10Basics of PipeliningTimeDiscrete time stepsRepresented as 1, 2, 3, …SpacePipe stages or segments (things that do processing)Represented as F, D, X, M, W (for the DLX pipeline)OperandsInstructions or data itemsThings that flow through, and are processed by, the pipelineRepresented as a, b, c, …In drawing pipelines, we conceal the obvious fact that each operand undergoes some changes in each pipe stage11Basic TermsFilling a pipelineFlushing or draining a pipelineStage or segment delayEach stage may have a different stage delayBeat time (= max stage delay)Number of stagesEnd-to-end latencynumber of stages × beat timeStages are separated by latches (registers)12Speedup & Throughput of a Linear PipelinettNntNTNNnNnnNTTnNnNtTtNntTnNtTtNnppseqppseq1)1( Throughput),min(1 pipeline of Speedup)CPI1( 11 pipelining with CPI)1( pipelining with timeExecution pipelining without timeExecution delay Stage1 operands ofNumber 1 stages pipe ofNumber 13End of Cycle 4: Load’s Mem, R-type’s Exec, Store’s Reg, Beq’s IfetchEnd of Cycle 5: Load’s WrB, R-type’s Mem, Store’s Exec, Beq’s RegEnd of Cycle 6: R-type’s WrB, Store’s Mem, Beq’s ExecEnd of Cycle 7: Store’s WrB, Beq’s MemClockCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8Ifetch Reg/Dec Exec Mem WrB0: LoadIfetch Reg/Dec Exec Mem WrB4: R-typeIfetch Reg/Dec Exec Mem WrB8: StoreIfetch Reg/Dec Exec Mem WrB12: Beq (target is 1000)End ofCycle 4End ofCycle 5End ofCycle 6End ofCycle 7A More Extensive Pipelining Example14Data Hazard: SetupInstruction uInstruction uD(u): domain of instruction u The set of all memory locations, registers (including implicit ones), flags, condition codes etc. that may be read by instruction uR(u): range of instruction u The set of all memory locations, registers (including implicit ones), flags, condition codes etc. that may be written by instruction uInstruction uInstruction vInstruction uInstruction vu < v is a relation that means that instructionu precedes instruction v in the original programorder (i.e., on an unpipelined machine)•The relation < is irreflexive, anti-symmetric, and transitive15Data Hazard: DefinitionGiven two instructions u and v, such that u < v, there is adata hazard between them if any of the following conditionsholds:hazard (WAW) Write-After-ite Wr)()(hazard (WAR) Read-After-ite Wr)()(hazard (RAW)
View Full Document