A relevant questionThe slow wayLaundry PipeliningPipelining LessonsPipelining is not just MultiprocessingPipeliningInstruction execution reviewSingle-cycle datapath diagramExample: Instruction Fetch (IF)Instruction Decode (ID)Execute (EX)Memory (MEM)Writeback (WB)A bunch of lazy functional unitsPutting those slackers to workDecoding and fetching togetherExecuting, decoding and fetchingMaking Pipelining WorkBreak datapath into 5 stagesPipelining LoadsPipelining PerformancePipeline Datapath: Resource RequirementsPipelining other instruction typesImportant ObservationA solution: Insert NOP stagesSummaryJanuary 14, 2019 ©2003 Craig Zilles (derived from slides by Howard Huang and David Patterson)1A relevant questionAssuming you’ve got:—One washer (takes 30 minutes)—One drier (takes 40 minutes)—One “folder” (takes 20 minutes)It takes 90 minutes to wash, dry, and fold 1 load of laundry.—How long does 4 loads take?January 14, 2019 Pipelining 2The slow wayIf each load is done sequentially it takes 6 hours30 40 20 30 40 20 30 40 20 30 40 206 PM7 8 91011MidnightTimeJanuary 14, 2019 Pipelining 3Laundry PipeliningStart each load as soon as possible—Overlap loadsPipelined laundry takes 3.5 hours6 PM7 8 91011MidnightTime30 40 40 40 40 20January 14, 2019 Pipelining 4Pipelining LessonsPipelining doesn’t help latency of single load, it helps throughput of entire workloadPipeline rate limited by slowest pipeline stageMultiple tasks operating simultaneously using different resourcesPotential speedup = Number pipe stagesUnbalanced lengths of pipe stages reduces speedupTime to “fill” pipeline and time to “drain” it reduces speedup6 PM7 8 9Time30 40 40 40 40 20March 17, 2003 Pipelining 5Pipelining is not just MultiprocessingPipelining does involve parallel processing, but in a specific way.Both multiprocessing and pipelining relate to the processing of multiple “things” using multiple “functional units” —Multiprocessing implies each thing is processed entirely by a single functional unit•e.g., multiple lanes at the supermarket—In pipelining, each thing is broken into a sequence of pieces, where each piece is handled by a different (specialized) functional unit.•Supermarket analogy?Pipelining and multiprocessing are not mutually exclusive—Modern processors do both, with multiple pipelines (e.g., superscalar)January 14, 2019 Pipelining 6PipeliningPipelining is a general-purpose efficiency technique—It is not specific to processorsPipelining is used in:—Assembly lines—Bucket brigades—Fast food restaurantsPipelining is used in other CS disciplines:—Networking—Server software architectureUseful to increase throughput in the presence of long latency—More on that later…January 14, 2019 Pipelining 7Instruction execution reviewExecuting a MIPS instruction can take up to five steps.However, as we saw, not all instructions need all five steps.Step NameDescriptionInstruction FetchIF Read an instruction from memory.Instruction DecodeID Read source registers and generate control signals.Execute EX Compute an R-type result or a branch outcome.Memory MEM Read or write the data memory.Writeback WB Store a result in the destination register.InstructionSteps requiredbeq IF ID EXR-type IF ID EX WBsw IF ID EX MEMlw IF ID EX MEM WBJanuary 14, 2019 Pipelining 8Single-cycle datapath diagram4Shiftleft 2PCAddAdd0Mux1PCSrcReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegReadaddressInstructionmemoryInstruction[31-0]I [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWriteSignextend0Mux1ALUSrcResultZeroALUALUOp2ns2ns2ns1nsHow long does it take to execute each instruction?January 14, 2019 Pipelining 10Example: Instruction Fetch (IF)ReadaddressInstructionmemoryInstruction[31-0]ReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegSignextend0Mux1ALUSrcResultZeroALUALUOpI [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWriteLet’s quickly review how lw is executed in the single-cycle datapath.We’ll ignore PC incrementing and branching for now.In the Instruction Fetch (IF) step, we read the instruction memory.January 14, 2019 Pipelining 11Instruction Decode (ID)ReadaddressInstructionmemoryInstruction[31-0]ReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegSignextend0Mux1ALUSrcResultZeroALUALUOpI [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWriteThe Instruction Decode (ID) step reads the source register from the register file.January 14, 2019 Pipelining 12Execute (EX)ReadaddressInstructionmemoryInstruction[31-0]ReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegSignextend0Mux1ALUSrcResultZeroALUALUOpI [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWriteThe third step, Execute (EX), computes the effective memory address from the source register and the instruction’s constant field.January 14, 2019 Pipelining 13Memory (MEM)ReadaddressInstructionmemoryInstruction[31-0]ReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegSignextend0Mux1ALUSrcResultZeroALUALUOpI [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWriteThe Memory (MEM) step involves reading the data memory, from the address computed by the ALU.January 14, 2019 Pipelining 14Writeback (WB)ReadaddressInstructionmemoryInstruction[31-0]ReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegSignextend0Mux1ALUSrcResultZeroALUALUOpI [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWriteFinally, in the Writeback (WB) step, the memory value is stored into the destination register.January 14, 2019 Pipelining 15A bunch of lazy functional unitsNotice that each execution step uses a different functional unit. In other words, the main units are idle for most of the 8ns cycle!—The instruction RAM is used for just 2ns at the start of the cycle.—Registers are read once in ID
View Full Document