Pipelining IOverviewReal-World Pipelines: Car WashesLaundry exampleSequential LaundryPipelined Laundry: Start ASAPPipelining LessonsLatency and ThroughputRelationship between Latency and ThroughputComputational Example3-Way Pipelined VersionPipeline DiagramsOperating a PipelineLimitations: Nonuniform DelaysLimitations: Register OverheadCPU Performance EquationCycles Per Instruction (CPI)Comparing and Summarizing PerformanceMeansPowerPoint PresentationIs Speed the Last Word in Performance?Revisiting the Performance EqnData DependenciesData HazardsData Dependencies in ProcessorsSEQ HardwareSEQ+ HardwareAdding Pipeline RegistersPipeline StagesSummaryPipelining ITopicsTopicsPipelining principlesPipeline overheadsPipeline registers and stagesSystems I2OverviewWhat’s wrong with the sequential (SEQ) Y86?What’s wrong with the sequential (SEQ) Y86?It’s slow!Each piece of hardware is used only a small fraction of timeWe would like to find a way to get more performance with only a little more hardwareGeneral Principles of PipeliningGeneral Principles of PipeliningGoalDifficultiesCreating a Pipelined Y86 ProcessorCreating a Pipelined Y86 ProcessorRearranging SEQInserting pipeline registersProblems with data and control hazards3Real-World Pipelines: Car WashesIdeaIdeaDivide process into independent stagesMove objects through stages in sequenceAt any given times, multiple objects being processedSequential ParallelPipelined4Laundry exampleAnn, Brian, Cathy, Dave Ann, Brian, Cathy, Dave each have one load of clothes each have one load of clothes to wash, dry, and foldto wash, dry, and foldWasher takes 30 minutesWasher takes 30 minutesDryer takes 30 minutesDryer takes 30 minutes““Folder” takes 30 minutesFolder” takes 30 minutes““Stasher” takes 30 minutesStasher” takes 30 minutesto put clothes into drawersto put clothes into drawersA B C DSlide courtesy of D. Patterson5Sequential LaundrySequential laundry takes 8 hours for 4 loadsSequential laundry takes 8 hours for 4 loadsIf they learned pipelining, how long would laundry take? If they learned pipelining, how long would laundry take? 30TaskOrderBCDATime3030 3030 30 3030 3030 3030 3030 30306 PM78910111212 AMSlide courtesy of D. Patterson6Pipelined Laundry: Start ASAPPipelined laundry takes 3.5 hours for 4 loads!Pipelined laundry takes 3.5 hours for 4 loads! TaskOrder122 AM6 PM78910111TimeBCDA303030 30303030Slide courtesy of D. Patterson7Pipelining LessonsPipelining doesn’t help Pipelining doesn’t help latencylatency of single task, it helps of single task, it helps throughputthroughput of entire workload of entire workloadMultipleMultiple tasks operating tasks operating simultaneously using simultaneously using different resourcesdifferent resourcesPotential speedup = Potential speedup = Number Number pipe stagespipe stagesPipeline rate limited by Pipeline rate limited by slowestslowest pipeline stagepipeline stageUnbalanced lengths of pipe Unbalanced lengths of pipe stages reduces speedupstages reduces speedupTime to “Time to “fillfill” pipeline and time ” pipeline and time to “to “draindrain” it reduces speedup” it reduces speedupStall for DependencesStall for Dependences6 PM7 8 9TimeBCDA303030 30303030TaskOrderSlide courtesy of D. Patterson8Latency and ThroughputLatency: time to complete an operationLatency: time to complete an operationThroughput: work completed per unit timeThroughput: work completed per unit timeConsider plumbingConsider plumbingLow latency: turn on faucet and water comes outHigh bandwidth: lots of water (e.g., to fill a pool)What is “High speed Internet?”What is “High speed Internet?”Low latency: needed to interactive gamingHigh bandwidth: needed for downloading large filesMarketing departments like to conflate latency and bandwidth…9Relationship between Latency and ThroughputLatency and bandwidth only loosely coupledLatency and bandwidth only loosely coupledHenry Ford: assembly lines increase bandwidth without reducing latencyMy factory takes 1 day to make a Model-T ford.My factory takes 1 day to make a Model-T ford.But I can start building a new car every 10 minutesAt 24 hrs/day, I can make 24 * 6 = 144 cars per dayA special order for 1 green car, still takes 1 dayThroughput is increased, but latency is not.Latency reduction is difficultLatency reduction is difficultOften, one can buy bandwidthOften, one can buy bandwidthE.g., more memory chips, more disks, more computersBig server farms (e.g., google) are high bandwidth10Computational ExampleSystemSystemComputation requires total of 300 picosecondsAdditional 20 picoseconds to save result in registerMust have clock cycle of at least 320 psCombinationallogicReg300 ps 20 psClockDelay = 320 psThroughput = 3.12 GOPS113-Way Pipelined VersionSystemSystemDivide combinational logic into 3 blocks of 100 ps eachCan begin new operation as soon as previous one passes through stage A.Begin new operation every 120 psOverall latency increases360 ps from start to finishRegClockComb.logicARegComb.logicBRegComb.logicC100 ps 20 ps 100 ps 20 ps 100 ps 20 psDelay = 360 psThroughput = 8.33 GOPS12Pipeline DiagramsUnpipelinedUnpipelinedCannot start new operation until previous one completes3-Way Pipelined3-Way PipelinedUp to 3 operations in process simultaneouslyTimeOP1OP2OP3TimeA B CA B CA B COP1OP2OP313Operating a PipelineTimeOP1OP2OP3A B CA B CA B C0 120 240 360 480 640ClockRegClockComb.logicARegComb.logicBRegComb.logicC100 ps 20 ps 100 ps 20 ps 100 ps 20 ps239RegClockComb.logicARegComb.logicBRegComb.logicC100 ps 20 ps 100 ps 20 ps 100 ps 20 ps241RegRegReg100 ps 20 ps 100 ps 20 ps 100 ps 20 psComb.logicAComb.logicBComb.logicCClock300RegClockComb.logicARegComb.logicBRegComb.logicC100 ps 20 ps 100 ps 20 ps 100 ps 20 ps35914Limitations: Nonuniform DelaysThroughput limited by slowest stageOther stages sit idle for much of the timeChallenging to partition system into balanced stagesRegClockRegComb.logicBRegComb.logicC50 ps 20 ps 150 ps 20 ps 100 ps 20 psDelay = 510 psThroughput = 5.88 GOPSComb.logicATimeOP1OP2OP3A B CA B CA B C15Limitations: Register OverheadAs try to deepen pipeline, overhead of loading registers becomes more significantPercentage of clock cycle spent loading register:1-stage pipeline: 6.25% 3-stage pipeline: 16.67% 6-stage pipeline: 28.57%High speeds of modern processor designs
View Full Document