Unformatted text preview:

Pipelining I OverviewReal-World Pipelines: Car WashesLaundry exampleSequential LaundryPipelined Laundry: Start ASAPPipelining LessonsLatency and ThroughputRelationship between Latency and ThroughputComputational Example3-Way Pipelined VersionPipeline DiagramsOperating a PipelineLimitations: Nonuniform DelaysLimitations: Register OverheadCPU Performance EquationCycles Per Instruction (CPI)Comparing and Summarizing PerformanceMeansSlide Number 20Is Speed the Last Word in Performance?Revisiting the Performance EqnData DependenciesData HazardsData Dependencies in ProcessorsSEQ HardwareSEQ+ HardwareAdding Pipeline RegistersPipeline StagesSummaryPipelining ITopics Pipelining principles Pipeline overheads Pipeline registers and stagesSystems I2OverviewWhat’s wrong with the sequential (SEQ) Y86? It’s slow! Each piece of hardware is used only a small fraction of time We would like to find a way to get more performance with only a little more hardwareGeneral Principles of Pipelining Goal DifficultiesCreating a Pipelined Y86 Processor Rearranging SEQ Inserting pipeline registers Problems with data and control hazards3Real-World Pipelines: Car WashesIdea Divide process into independent stages Move objects through stages in sequence At any given times, multiple objects being processedSequential ParallelPipelined4Laundry exampleAnn, Brian, Cathy, Dave each have one load of clothes to wash, dry, and foldWasher takes 30 minutesDryer takes 30 minutes“Folder” takes 30 minutes“Stasher” takes 30 minutesto put clothes into drawersA B C DSlide courtesy of D. Patterson5Sequential LaundrySequential laundry takes 8 hours for 4 loadsIf they learned pipelining, how long would laundry take? 30TaskOrderBCDATime30 30 3030 30 3030 30 30 3030 30 30 30306 PM78910111212 AMSlide courtesy of D. Patterson6Pipelined Laundry: Start ASAPPipelined laundry takes 3.5 hours for 4 loads!TaskOrder122 AM6 PM78910111TimeBCDA3030 30 303030 30Slide courtesy of D. Patterson7Pipelining LessonsPipelining doesn’t help latencyof single task, it helps throughput of entire workloadMultiple tasks operating simultaneously using different resourcesPotential speedup = Number pipe stagesPipeline rate limited by slowestpipeline stageUnbalanced lengths of pipe stages reduces speedupTime to “fill” pipeline and time to “drain” it reduces speedupStall for Dependences6 PM7 8 9TimeBCDA3030 30 303030 30TaskOrderSlide courtesy of D. Patterson8Latency and ThroughputLatency: time to complete an operationThroughput: work completed per unit timeConsider plumbing Low latency: turn on faucet and water comes out High bandwidth: lots of water (e.g., to fill a pool)What is “High speed Internet?” Low latency: needed to interactive gaming High bandwidth: needed for downloading large files Marketing departments like to conflate latency and bandwidth…9Relationship between Latency and ThroughputLatency and bandwidth only loosely coupled Henry Ford: assembly lines increase bandwidth without reducing latencyMy factory takes 1 day to make a Model-T ford. But I can start building a new car every 10 minutes At 24 hrs/day, I can make 24 * 6 = 144 cars per day A special order for 1 green car, still takes 1 day Throughput is increased, but latency is not.Latency reduction is difficultOften, one can buy bandwidth E.g., more memory chips, more disks, more computers Big server farms (e.g., google) are high bandwidth10Computational ExampleSystem Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Must have clock cycle of at least 320 psCombinationallogicReg300 ps 20 psClockDelay = 320 psThroughput = 3.12 GOPS113-Way Pipelined VersionSystem Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps Overall latency increases 360 ps from start to finishRegClockComb.logicARegComb.logicBRegComb.logicC100 ps 20 ps 100 ps 20 ps 100 ps 20 psDelay = 360 psThroughput = 8.33 GOPS12Pipeline DiagramsUnpipelined Cannot start new operation until previous one completes3-Way Pipelined Up to 3 operations in process simultaneouslyTimeOP1OP2OP3TimeA B CA B CA B COP1OP2OP313Operating a PipelineTimeOP1OP2OP3A B CA B CA B C0 120 240 360 480 640ClockRegClockComb.logicARegComb.logicBRegComb.logicC100 ps 20 ps 100 ps 20 ps 100 ps 20 ps239RegClockComb.logicARegComb.logicBRegComb.logicC100 ps 20 ps 100 ps 20 ps 100 ps 20 ps241RegRegReg100 ps 20 ps 100 ps 20 ps 100 ps 20 psComb.logicAComb.logicBComb.logicCClock300RegClockComb.logicARegComb.logicBRegComb.logicC100 ps 20 ps 100 ps 20 ps 100 ps 20 ps35914Limitations: Nonuniform Delays Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stagesRegClockRegComb.logicBRegComb.logicC50 ps 20 ps 150 ps 20 ps 100 ps 20 psDelay = 510 psThroughput = 5.88 GOPSComb.logicATimeOP1OP2OP3A B CA B CA B C15Limitations: Register Overhead As try to deepen pipeline, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register: 1-stage pipeline: 6.25%  3-stage pipeline: 16.67%  6-stage pipeline: 28.57% High speeds of modern processor designs obtained through very deep pipeliningDelay = 420 ps, Throughput = 14.29 GOPSClockRegComb.logic50 ps 20 psRegComb.logic50 ps 20 psRegComb.logic50 ps 20 psRegComb.logic50 ps 20 psRegComb.logic50 ps 20 psRegComb.logic50 ps 20 ps16CPU Performance Equation3 components to execution time:Factors affecting CPU execution time:CycleSecondsnInstructioCyclesProgramnsInstructioProgramSeconds timeCPU ∗∗==Inst. Count CPI Clock RateProgram XCompiler X (X)Inst. Set X X (X)Organization X XMicroArch X XTechnology X• Consider all three elements when optimizing• Workloads change!17Cycles Per Instruction (CPI)Depends on the instructionAverage cycles per instructionExample:RateClock n instructio of timeExecution ∗= iCPIi∑==∗=nitotiiiiICICFFCPICPI1 whereOp Freq Cycles CPI(i) %timeALU 50% 1 0.5 33%Load 20% 2 0.4 27%Store 10% 2 0.2 13%Branch 20% 2 0.4 27%CPI(total) 1.518Comparing and Summarizing PerformanceFair way to summarize performance?Capture in a single number?Example: Which of the following machines is best?Computer A Computer B Computer CProgram 1 1 10 20Program 2 1000 100 20Total Time 1001 110


View Full Document

UT CS 429H - Pipelining I

Download Pipelining I
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Pipelining I and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Pipelining I 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?