Example from sectionPipelining lessonsPipelining not just MultiprocessingPipelining MIPSInstruction Fetch (IF)Instruction Decode (ID)Execute (EX)Memory (MEM)Writeback (WB)Decoding and fetching togetherExecuting, decoding and fetchingBreak datapath into 5 stagesPipelining PerformancePipeline Datapath: Resource RequirementsPipelining other instruction typesImportant ObservationA solution: Insert NOP stagesSummary1Example from sectionAssembling a sandwich:—ORD (8 seconds)—TOS (0 or 10 seconds)—ADD (0 to 10 seconds)—PAY (5 seconds)We can assemble sandwiches every 10 seconds with pipelining:A single sandwich takesbetween 13 and 33 secondsORDTOSADDPAYORDTOSADDPAYORDTOSADDPAY0 10 20 30 40 50 602Pipelining lessonsPipelining can increase throughput (#sandwiches per hour), but…1. Every sandwich must use all stages—prevents clashes in the pipeline2. Every stage must take the same amount of time—limited by the slowest stage (in this example, 10 seconds)These two factors decrease the latency (time per sandwich)!For an optimal k-stage pipeline:1. every stage does useful work2. stage lengths are balancedUnder these conditions, we nearly achieve the optimal speedup: k—“nearly” because there is still the fill and drain time3Pipelining not just MultiprocessingPipelining does involve parallel processing, but in a specific way.Both multiprocessing and pipelining relate to the processing of multiple “things” using multiple “functional units” —In multiprocessing, each thing is processed entirely by a single functional unit•e.g. multiple lanes at the supermarket—In pipelining, each thing is broken into a sequence of pieces, where each piece is handled by a different (specialized) functional unit.•e.g. checker vs. baggerPipelining and multiprocessing are not mutually exclusive—Modern processors do both, with multiple pipelines (e.g. superscalar)Pipelining is a general-purpose efficiency technique; used elsewhere in CS:—Networking, I/O devices, server software architecture4Pipelining MIPSExecuting a MIPS instruction can take up to five stagesNot all instructions need all five stages……but a single-cycle datapath must accommodate all 5 stages in one clockStep NameDescriptionInstruction FetchIF Read an instruction from memoryInstruction DecodeID Read source registers and generate control signalsExecute EX Compute an R-type result or a branch outcomeMemory MEM Read or write the data memoryWriteback WB Store a result in the destination registerInstructionSteps requiredbeq IF ID EXR-type IF ID EX WBsw IF ID EX MEMlw IF ID EX MEM WB5Instruction Fetch (IF)ReadaddressInstructionmemoryInstruction[31-0]ReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegSignextend0Mux1ALUSrcResultZeroALUALUOpI [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWriteWhile IF is executing, the rest of the datapath is sitting idle…6Instruction Decode (ID)ReadaddressInstructionmemoryInstruction[31-0]ReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegSignextend0Mux1ALUSrcResultZeroALUALUOpI [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWriteThen while ID is executing, the IF-related portion becomes idle…7Execute (EX)ReadaddressInstructionmemoryInstruction[31-0]ReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegSignextend0Mux1ALUSrcResultZeroALUALUOpI [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWrite..and so on for the EX portion…8Memory (MEM)ReadaddressInstructionmemoryInstruction[31-0]ReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegSignextend0Mux1ALUSrcResultZeroALUALUOpI [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWrite…the MEM portion…9Writeback (WB)ReadaddressInstructionmemoryInstruction[31-0]ReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegSignextend0Mux1ALUSrcResultZeroALUALUOpI [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWrite…and the WB portion.—what about the “clash” with the IF stage over the register file?Answer: Register file is written on the positive edge, but read later in the clock cycle. Hence, there is no clash.10Decoding and fetching togetherWhy don’t we go ahead and fetch the next instruction while we’re decoding the first one?InstructionmemoryInstruction[31-0]ReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegSignextend0Mux1ALUSrcResultZeroALUALUOpI [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWriteReadaddress Decode 1st instructionFetch 2nd11Executing, decoding and fetchingSimilarly, once the first instruction enters its Execute stage, we can go ahead and decode the second instruction.But now the instruction memory is free again, so we can fetch the third instruction!ReadaddressInstructionmemoryInstruction[31-0]ReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegSignextend0Mux1ALUSrcResultZeroALUALUOpI [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWriteDecode 2ndFetch 3rdExecute 1st12Break datapath into 5 stagesEach stage has its own functional unitsFull pipeline the datapath is simultaneously working on 5 instructions!ReadaddressInstructionmemoryInstruction[31-0]ReadaddressWriteaddressWritedataDatamemoryReaddataMemWriteMemRead1Mux0MemToRegSignextend0Mux1ALUSrcResultZeroALUALUOpI [15 - 0]I [25 - 21]I [20 - 16]I [15 - 11]0Mux1RegDstReadregister 1Readregister 2WriteregisterWritedataReaddata 2Readdata 1RegistersRegWriteIDIFEXE MEMWBnewestoldest15Pipelining PerformanceExecution time on ideal pipeline:—time to fill the pipeline + one cycle per instruction—How long for N instructions? k – 1 + N, where k = pipeline depthAlternate way of arriving at this formula: k cycles for the first instruction, plus 1 for each of the remaining N – 1
View Full Document