Unformatted text preview:

Pipelining IOverviewReal-World Pipelines: Car WashesLaundry exampleSequential LaundryPipelined Laundry: Start ASAPPipelining LessonsLatency and ThroughputRelationship between Latency and ThroughputComputational Example3-Way Pipelined VersionPipeline DiagramsOperating a PipelineLimitations: Nonuniform DelaysLimitations: Register OverheadCPU Performance EquationCycles Per Instruction (CPI)Comparing and Summarizing PerformanceMeansPowerPoint PresentationIs Speed the Last Word in Performance?Revisiting the Performance EqnData DependenciesData HazardsData Dependencies in ProcessorsSEQ HardwareSEQ+ HardwareAdding Pipeline RegistersPipeline StagesSummaryPipelining ITopicsTopicsPipelining principlesPipeline overheadsPipeline registers and stagesSystems I2OverviewWhat’s wrong with the sequential (SEQ) Y86?What’s wrong with the sequential (SEQ) Y86?It’s slow!Each piece of hardware is used only a small fraction of timeWe would like to find a way to get more performance with only a little more hardwareGeneral Principles of PipeliningGeneral Principles of PipeliningGoalDifficultiesCreating a Pipelined Y86 ProcessorCreating a Pipelined Y86 ProcessorRearranging SEQInserting pipeline registersProblems with data and control hazards3Real-World Pipelines: Car WashesIdeaIdeaDivide process into independent stagesMove objects through stages in sequenceAt any given times, multiple objects being processedSequential ParallelPipelined4Laundry exampleAnn, Brian, Cathy, Dave Ann, Brian, Cathy, Dave each have one load of clothes each have one load of clothes to wash, dry, and foldto wash, dry, and foldWasher takes 30 minutesWasher takes 30 minutesDryer takes 30 minutesDryer takes 30 minutes““Folder” takes 30 minutesFolder” takes 30 minutes““Stasher” takes 30 minutesStasher” takes 30 minutesto put clothes into drawersto put clothes into drawersA B C DSlide courtesy of D. Patterson5Sequential LaundrySequential laundry takes 8 hours for 4 loadsSequential laundry takes 8 hours for 4 loadsIf they learned pipelining, how long would laundry take? If they learned pipelining, how long would laundry take? 30TaskOrderBCDATime3030 3030 30 3030 3030 3030 3030 30306 PM78910111212 AMSlide courtesy of D. Patterson6Pipelined Laundry: Start ASAPPipelined laundry takes 3.5 hours for 4 loads!Pipelined laundry takes 3.5 hours for 4 loads! TaskOrder122 AM6 PM78910111TimeBCDA303030 30303030Slide courtesy of D. Patterson7Pipelining LessonsPipelining doesn’t help Pipelining doesn’t help latencylatency of single task, it helps of single task, it helps throughputthroughput of entire workload of entire workloadMultipleMultiple tasks operating tasks operating simultaneously using simultaneously using different resourcesdifferent resourcesPotential speedup = Potential speedup = Number Number pipe stagespipe stagesPipeline rate limited by Pipeline rate limited by slowestslowest pipeline stagepipeline stageUnbalanced lengths of pipe Unbalanced lengths of pipe stages reduces speedupstages reduces speedupTime to “Time to “fillfill” pipeline and time ” pipeline and time to “to “draindrain” it reduces speedup” it reduces speedupStall for DependencesStall for Dependences6 PM7 8 9TimeBCDA303030 30303030TaskOrderSlide courtesy of D. Patterson8Latency and ThroughputLatency: time to complete an operationLatency: time to complete an operationThroughput: work completed per unit timeThroughput: work completed per unit timeConsider plumbingConsider plumbingLow latency: turn on faucet and water comes outHigh bandwidth: lots of water (e.g., to fill a pool)What is “High speed Internet?”What is “High speed Internet?”Low latency: needed to interactive gamingHigh bandwidth: needed for downloading large filesMarketing departments like to conflate latency and bandwidth…9Relationship between Latency and ThroughputLatency and bandwidth only loosely coupledLatency and bandwidth only loosely coupledHenry Ford: assembly lines increase bandwidth without reducing latencyMy factory takes 1 day to make a Model-T ford.My factory takes 1 day to make a Model-T ford.But I can start building a new car every 10 minutesAt 24 hrs/day, I can make 24 * 6 = 144 cars per dayA special order for 1 green car, still takes 1 dayThroughput is increased, but latency is not.Latency reduction is difficultLatency reduction is difficultOften, one can buy bandwidthOften, one can buy bandwidthE.g., more memory chips, more disks, more computersBig server farms (e.g., google) are high bandwidth10Computational ExampleSystemSystemComputation requires total of 300 picosecondsAdditional 20 picoseconds to save result in registerMust have clock cycle of at least 320 psCombinationallogicReg300 ps 20 psClockDelay = 320 psThroughput = 3.12 GOPS113-Way Pipelined VersionSystemSystemDivide combinational logic into 3 blocks of 100 ps eachCan begin new operation as soon as previous one passes through stage A.Begin new operation every 120 psOverall latency increases360 ps from start to finishRegClockComb.logicARegComb.logicBRegComb.logicC100 ps 20 ps 100 ps 20 ps 100 ps 20 psDelay = 360 psThroughput = 8.33 GOPS12Pipeline DiagramsUnpipelinedUnpipelinedCannot start new operation until previous one completes3-Way Pipelined3-Way PipelinedUp to 3 operations in process simultaneouslyTimeOP1OP2OP3TimeA B CA B CA B COP1OP2OP313Operating a PipelineTimeOP1OP2OP3A B CA B CA B C0 120 240 360 480 640ClockRegClockComb.logicARegComb.logicBRegComb.logicC100 ps 20 ps 100 ps 20 ps 100 ps 20 ps239RegClockComb.logicARegComb.logicBRegComb.logicC100 ps 20 ps 100 ps 20 ps 100 ps 20 ps241RegRegReg100 ps 20 ps 100 ps 20 ps 100 ps 20 psComb.logicAComb.logicBComb.logicCClock300RegClockComb.logicARegComb.logicBRegComb.logicC100 ps 20 ps 100 ps 20 ps 100 ps 20 ps35914Limitations: Nonuniform DelaysThroughput limited by slowest stageOther stages sit idle for much of the timeChallenging to partition system into balanced stagesRegClockRegComb.logicBRegComb.logicC50 ps 20 ps 150 ps 20 ps 100 ps 20 psDelay = 510 psThroughput = 5.88 GOPSComb.logicATimeOP1OP2OP3A B CA B CA B C15Limitations: Register OverheadAs try to deepen pipeline, overhead of loading registers becomes more significantPercentage of clock cycle spent loading register:1-stage pipeline: 6.25% 3-stage pipeline: 16.67% 6-stage pipeline: 28.57%High speeds of modern processor designs


View Full Document

UT CS 429H - Pipelining

Download Pipelining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Pipelining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Pipelining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?