UT CS 378 - Programming for Performance Single-Thread Performance: Pipelining - D1711204

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 378> Programming for Performance Single-Thread Performance: Pipelining

DOC PREVIEW

UT CS 378 - Programming for Performance Single-Thread Performance: Pipelining

School name University of Texas at Austin

Course Cs 378- Declarative Programming

Pages 31

This preview shows page 1-2-14-15-30-31 out of 31 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Pipelining: It’s Natural!Sequential LaundryPipelined Laundry: Start work ASAPPipelining LessonsThe Five Stages of a Load InstructionKey Ideas Behind Instruction PipeliningPipelining the Load InstructionSingle Cycle vs. Multiple Cycle vs. PipelinedBasics of PipeliningBasic TermsSpeedup & Throughput of a Linear PipelineA More Extensive Pipelining ExampleData Hazard: SetupData Hazard: DefinitionData Hazard: Effect on CompilerWhy Data Hazards OccurData Dependence and HazardsMore on HazardsData Hazard: Effect on PipeliningSolution: Interlocks and StallingOptimization: Value ForwardingExample of ForwardingForwarding & StallingCompile-Time SchedulingControl HazardMore on Control HazardsDelayed Branches on DLXPipelining Multicycle OperationsSpace-Time Diagram: Multicycle OperationsFloating-Point Operations in DLXCS 378Programming for PerformanceSingle-Thread Performance: PipeliningAdopted fromSiddhartha ChatterjeeSpring 20092A B C DPipelining: It’s Natural!Laundry ExampleAnn, Brian, Cathy, Dave each have one load of clothes to wash, dry, and foldWasher takes 30 minutesDryer takes 40 minutes“Folder” takes 20 minutes3Sequential laundry takes 6 hours for 4 loadsIf they learned pipelining, how long would laundry take? ABCD30 40 20 30 40 20 30 40 20 30 40 206 PM7 8 91011MidnightTaskOrderTimeSequential Laundry4Pipelined laundry takes 3.5 hours for 4 loads ABCD6 PM7 8 91011MidnightTaskOrderTime30 40 40 40 40 20Pipelined Laundry: Start work ASAP5ABCD6 PM7 8 9TaskOrderTime30 40 404040 20Pipelining LessonsPipelining doesn’t help latency of single task, it helps throughput of entire workloadPipeline rate limited by slowest pipeline stageMultiple tasks operating simultaneouslyPotential speedup = Number pipe stagesUnbalanced lengths of pipe stages reduces speedupTime to “fill” pipeline and time to “drain” it reduces speedup6Ifetch: Instruction fetchFetch the instruction from the instruction memoryReg/Dec: Instruction decode and register readExec: Calculate the memory addressMem: Read the data from the data memoryWrB: Write the data back to the register fileCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5Ifetch Reg/Dec Exec Mem WrBLoadThe Five Stages of a Load Instruction7The load instruction has 5 stages: Five independent functional units to work on each stage•Each functional unit is used only once!A second load can start doing Ifetch as soon as the first load finishes its Ifetch stageEach load still takes five cycles to complete•The latency of a single load is still 5 cyclesThe throughput is much higher•CPI approaches 1 •Cycle time is ~1/5th the cycle time of the single-cycle implementationInstructions start executing before previous instructions complete executionIfetch Reg/Dec Exec Mem WrBLoadKey Ideas Behind Instruction PipeliningCPI Cycle time 8Pipelining the Load InstructionThe five independent pipeline stages are:Read next instruction: The Ifetch stageDecode instruction and fetch register values: The Reg/Dec stageExecute the operation: The Exec stageAccess data memory: The Mem stageWrite data to destination register: The WrB stageOne instruction enters the pipeline every cycleOne instruction comes out of the pipeline (completed) every cycleThe “effective” CPI is 7/3 (tends to 1)ClockCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7Ifetch Reg/Dec Exec Mem WrB1st lwIfetch Reg/Dec Exec Mem WrB2nd lwIfetch Reg/Dec Exec Mem WrB3rd lw9WrClkCycle 1Multiple Cycle Implementation:Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10LoadIfetch Reg Exec Mem WrIfetch Reg Exec MemLoad StorePipelined Implementation:Ifetch Reg Exec Mem WrStoreClkSingle Cycle Implementation:Load Store WasteIfetchR-typeIfetch Reg Exec Mem WrR-typeCycle 1 Cycle 2Ifetch Reg Exec MemSingle Cycle vs. Multiple Cycle vs. Pipelined10Basics of PipeliningTimeDiscrete time stepsRepresented as 1, 2, 3, …SpacePipe stages or segments (things that do processing)Represented as F, D, X, M, W (for the DLX pipeline)OperandsInstructions or data itemsThings that flow through, and are processed by, the pipelineRepresented as a, b, c, …In drawing pipelines, we conceal the obvious fact that each operand undergoes some changes in each pipe stage11Basic TermsFilling a pipelineFlushing or draining a pipelineStage or segment delayEach stage may have a different stage delayBeat time (= max stage delay)Number of stagesEnd-to-end latencynumber of stages × beat timeStages are separated by latches (registers)12Speedup & Throughput of a Linear PipelinettNntNTNNnNnnNTTnNnNtTtNntTnNtTtNnppseqppseq1)1( Throughput),min(1 pipeline of Speedup)CPI1( 11 pipelining with CPI)1( pipelining with timeExecution pipelining without timeExecution delay Stage1 operands ofNumber 1 stages pipe ofNumber 13End of Cycle 4: Load’s Mem, R-type’s Exec, Store’s Reg, Beq’s IfetchEnd of Cycle 5: Load’s WrB, R-type’s Mem, Store’s Exec, Beq’s RegEnd of Cycle 6: R-type’s WrB, Store’s Mem, Beq’s ExecEnd of Cycle 7: Store’s WrB, Beq’s MemClockCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8Ifetch Reg/Dec Exec Mem WrB0: LoadIfetch Reg/Dec Exec Mem WrB4: R-typeIfetch Reg/Dec Exec Mem WrB8: StoreIfetch Reg/Dec Exec Mem WrB12: Beq (target is 1000)End ofCycle 4End ofCycle 5End ofCycle 6End ofCycle 7A More Extensive Pipelining Example14Data Hazard: SetupInstruction uInstruction uD(u): domain of instruction u The set of all memory locations, registers (including implicit ones), flags, condition codes etc. that may be read by instruction uR(u): range of instruction u The set of all memory locations, registers (including implicit ones), flags, condition codes etc. that may be written by instruction uInstruction uInstruction vInstruction uInstruction vu < v is a relation that means that instructionu precedes instruction v in the original programorder (i.e., on an unpipelined machine)•The relation < is irreflexive, anti-symmetric, and transitive15Data Hazard: DefinitionGiven two instructions u and v, such that u < v, there is adata hazard between them if any of the following conditionsholds:hazard (WAW) Write-After-ite Wr)()(hazard (WAR) Read-After-ite Wr)()(hazard (RAW)

View Full Document