UT CS 378 - Programming for Performance Single-Thread Performance - D1738112

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 378> Programming for Performance Single-Thread Performance

DOC PREVIEW

UT CS 378 - Programming for Performance Single-Thread Performance

School name University of Texas at Austin

Course Cs 378- Declarative Programming

Pages 36

This preview shows page 1-2-17-18-19-35-36 out of 36 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 378 Programming for Performance Single-Thread Performance: Review of PipeliningPipelining: It’s Natural!Sequential LaundryPipelined Laundry: Start work ASAPPipelining LessonsThe Five Stages of LoadKey Ideas Behind Instruction PipeliningPipelining the Load InstructionA More Extensive Pipelining ExampleSingle Cycle vs. Multiple Cycle vs. PipelinedBasics of PipeliningNotations for Describing PipelinesBasic TermsSpeedup & Throughput of a Linear PipelineData Hazard: SetupData Hazard: DefinitionWhy Data Hazards OccurData Dependence and HazardsMore on HazardsThe Precedence RelationExample of Precedence RelationData Hazard: Effect on CompilerData Hazard: Effect on PipeliningSolution: Interlocks and StallingOptimization: Value ForwardingExample of ForwardingForwarding & StallingCompile-Time SchedulingCode Generation Examples for BranchesControl HazardMore on Control HazardsDelayed Branches on DLXDetails of Various Branch FlavorsPipelining Multicycle OperationsSpace-Time Diagram: Multicycle OperationsFloating-Point Operations in DLXCS 378Programming for PerformanceSingle-Thread Performance: Review of PipeliningSiddhartha ChatterjeeSpring 2008Spring 2008 Siddhartha Chatterjee 2A B C DPipelining: It’s Natural!Laundry ExampleAnn, Brian, Cathy, Dave each have one load of clothes to wash, dry, and foldWasher takes 30 minutesDryer takes 40 minutes“Folder” takes 20 minutesSpring 2008 Siddhartha Chatterjee 3Sequential laundry takes 6 hours for 4 loadsIf they learned pipelining, how long would laundry take? ABCD30 40 20 30 40 20 30 40 20 30 40 206 PM7 8 91011MidnightTaskOrderTimeSequential LaundrySpring 2008 Siddhartha Chatterjee 4Pipelined laundry takes 3.5 hours for 4 loads ABCD6 PM7 8 91011MidnightTaskOrderTime30 40 40 40 40 20Pipelined Laundry: Start work ASAPSpring 2008 Siddhartha Chatterjee 5ABCD6 PM7 8 9TaskOrderTime30 40 40 40 40 20Pipelining LessonsPipelining doesn’t help latency of single task, it helps throughput of entire workloadPipeline rate limited by slowest pipeline stageMultiple tasks operating simultaneouslyPotential speedup = Number pipe stagesUnbalanced lengths of pipe stages reduces speedupTime to “fill” pipeline and time to “drain” it reduces speedupSpring 2008 Siddhartha Chatterjee 6Ifetch: Instruction FetchFetch the instruction from the Instruction MemoryReg/Dec: Registers Fetch and Instruction DecodeExec: Calculate the memory addressMem: Read the data from the Data MemoryWrB: Write the data back to the register fileCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5Ifetch Reg/Dec Exec Mem WrBLoadThe Five Stages of LoadSpring 2008 Siddhartha Chatterjee 7The load instruction has 5 stages: Five independent functional units to work on each stage•Each functional unit is used only once!A second load can start doing Ifetch as soon as the first load finishes its Ifetch stageEach load still takes five cycles to complete•The latency of a single load is still 5 cyclesThe throughput is much higher•CPI approaches 1 •Cycle time is ~1/5th the cycle time of the single-cycle implementationInstructions start executing before previous instructions complete executionIfetch Reg/Dec Exec Mem WrBLoadKey Ideas Behind Instruction PipeliningCPI Cycle time Spring 2008 Siddhartha Chatterjee 8Pipelining the Load InstructionThe five independent pipeline stages are:Read next instruction: The Ifetch stageDecode instruction and fetch register values: The Reg/Dec stageExecute the operation: The Exec stageAccess data memory: The Mem stageWrite data to destination register: The WrB stageOne instruction enters the pipeline every cycleOne instruction comes out of the pipeline (completed) every cycleThe “effective” CPI is 7/3 (tends to 1); ~1/5 cycle timeClockCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7Ifetch Reg/Dec Exec Mem WrB1st lwIfetch Reg/Dec Exec Mem WrB2nd lwIfetch Reg/Dec Exec Mem WrB3rd lwSpring 2008 Siddhartha Chatterjee 9End of Cycle 4: Load’s Mem, R-type’s Exec, Store’s Reg, Beq’s IfetchEnd of Cycle 5: Load’s WrB, R-type’s Mem, Store’s Exec, Beq’s RegEnd of Cycle 6: R-type’s WrB, Store’s Mem, Beq’s ExecEnd of Cycle 7: Store’s WrB, Beq’s MemClockCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8Ifetch Reg/Dec Exec Mem WrB0: LoadIfetch Reg/Dec Exec Mem WrB4: R-typeIfetch Reg/Dec Exec Mem WrB8: StoreIfetch Reg/Dec Exec Mem WrB12: Beq (target is 1000)End ofCycle 4End ofCycle 5End ofCycle 6End ofCycle 7A More Extensive Pipelining ExampleSpring 2008 Siddhartha Chatterjee 10WrClkCycle 1Multiple Cycle Implementation:Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10Load Ifetch Reg Exec Mem WrIfetch Reg Exec MemLoad StorePipelined Implementation:Ifetch Reg Exec Mem WrStoreClkSingle Cycle Implementation:Load Store WasteIfetchR-typeIfetch Reg Exec Mem WrR-typeCycle 1 Cycle 2Ifetch Reg Exec MemSingle Cycle vs. Multiple Cycle vs. PipelinedSpring 2008 Siddhartha Chatterjee 11Basics of PipeliningTimeDiscrete time stepsRepresented as 1, 2, 3, …SpacePipe stages or segments (things that do processing)Represented as P, Q, R, S (or F, D, X, M, W for the DLX pipeline)OperandsInstructions or data itemsThings that flow through, and are processed by, the pipelineRepresented as a, b, c, …In drawing pipelines, we conceal the obvious fact that each operand undergoes some changes in each pipe stageSpring 2008 Siddhartha Chatterjee 12Notations for Describing Pipelines1 2 3 4 5 …P a b c d ?Q - a b c dR - - a b cS - - - a b1 2 3 4 5 …a P Q R Sb P Q R Sc P Q Rd P Q…•Space-time diagram, or Gantt chart•Reservation table by stages•Rows represent pipeline stages•Unbounded one way•Notation of HP•Reservation table by instructions•Rows represent operands•Unbounded both waysSpring 2008 Siddhartha Chatterjee 13Basic TermsFilling a pipelineFlushing or draining a pipelineStage or segment delayEach stage may have a different stage delayBeat time (= max stage delay)Number of stagesEnd-to-end latencynumber of stages × beat timeStages are separated by latches (registers)Spring 2008 Siddhartha Chatterjee 14Speedup & Throughput of a Linear PipelinettNntNTNNnNnnNTTnNnNtTtNntTnNtTtNnppseqppseq1)1( Throughput),min(1 pipeline of Speedup)CPI1( 11 pipelining with CPI)1( pipelining with timeExecution pipelining without timeExecution delay

View Full Document