Unformatted text preview:

CMSC 611: AdvancedCMSC 611: AdvancedComputer ArchitectureComputer ArchitecturePipeliningPipeliningSome material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slidesSome material adapted from Hennessy & Patterson / © 2003 Elsevier Science• Washer takes 30 min, Dryer takes 40 min, folding takes 20 min• Sequential laundry takes 6 hours for 4 loads• If they learned pipelining, how long would laundry take?30 40 20 30 40 20 30 40 20 30 40 206 PM7 8 91011MidnightSlide: Dave PattersonTimeABCDTaskOrderSequential LaundrySequential Laundry• Pipelining means start work as soon as possible• Pipelined laundry takes 3.5 hours for 4 loadsABCD6 PM7 8 91011MidnightTaskOrderTime30 40 40 40 40 20Slide: Dave PattersonPipelined LaundryPipelined LaundryPipelining LessonsPipelining Lessons• Pipelining doesn’t help latency of singletask, it helps throughput of entireworkload• Pipeline rate limited by slowest pipelinestage• Multiple tasks operating simultaneouslyusing different resources• Potential speedup = Number pipestages• Unbalanced lengths of pipe stagesreduces speedup• Time to “fill” pipeline and time to “drain”it reduce speedup• Stall for DependenciesTime6 PM7 8 9Slide: Dave PattersonABCDTaskOrder30 40 40 40 40 20op target address026316 bits 26 bitsop rs rt rd shamt funct0611162126316 bits 6 bits5 bits5 bits5 bits5 bitsop rs rtimmediate0162126316 bits 16 bits5 bits5 bitsMIPS Instruction SetMIPS Instruction Set• RISC characterized by the followingfeatures that simplify implementation:– All ALU operations apply only on registers– Memory is affected only by load and store– Instructions follow very few formats andtypically are of the same sizeMIPS Instruction FormatsMIPS Instruction Formats• R-type (register)– Most operations• add $t1, $s3, $s4 # $t1 = $s3 + $s4– rd, rs, rt all registers– op always 0, funct gives actual functionop rs rt rd shamt funct0611162126316 bits 6 bits5 bits5 bits5 bits5 bitsMIPS Instruction FormatsMIPS Instruction Formats• I-type (immediate)– ALU with one immediate operand• addi $t1, $s2, 32 # $t1 = $s2 + 32– Load, store within ±215 of register• lw $t0, 32($s2) # $s1 = $s2[32] or *(32+s2)– Load immediate values• lui $t0, 255 # $t0 = (255<<16)• li $t0, 255op rs rtimmediate0162126316 bits 16 bits5 bits5 bitsMIPS Instruction FormatsMIPS Instruction Formats• I-type (immediate)– PC-relative conditional branch– ±215 from PC after instruction• beq $s1, $s2, L1 # goto L1 if ($s1 = $s2)• bne $s1, $s2, L1 # goto L1 if ($s1 ! $s2)op rs rtimmediate0162126316 bits 16 bits5 bits5 bitsMIPS Instruction FormatsMIPS Instruction Formats• J-type (jump)– unconditional jump• j L1 # goto L1– Address is concatenated to top bits of PC• Fixed addressing within 226op target address026316 bits 26 bitsSingle-cycle ExecutionSingle-cycle Execution!Figure: Dave Patterson! Instruction fetch cycle (IF)IR ! Mem[PC]; NPC ! PC + 4" Instruction decode/register fetch cycle (ID)A ! Regs[IR6..10]; B ! Regs[IR11..15]; Imm ! ((IR16)16 ##IR16..31)# Execution/effective address cycle (EX)Memory ref: ALUOutput ! A + Imm;Reg-Reg ALU: ALUOutput ! A func B;Reg-Imm ALU: ALUOutput ! A op Imm;Branch: ALUOutput ! NPC + Imm; Cond ! (A op 0)$ Memory access/branch completion cycle (MEM)Memory ref: LMD ! Mem[ALUOutput] or Mem(ALUOutput] ! B;Branch: if (cond) PC !ALUOutput;% Write-back cycle (WB)Reg-Reg ALU: Regs[IR16..20] ! ALUOutput;Reg-Imm ALU: Regs[IR11..15] ! ALUOutput;Load: Regs[IR11..15] ! LMD;Multi-Cycle Implementation ofMulti-Cycle Implementation ofMIPSMIPS! " $ %Multi-cycle ExecutionMulti-cycle Execution#Figure: Dave PattersonStages of InstructionStages of InstructionExecutionExecution• The load instruction is the longest• All instructions follows at most the following five steps:– Ifetch: Instruction Fetch• Fetch the instruction from the Instruction Memory and update PC– Reg/Dec: Registers Fetch and Instruction Decode– Exec: Calculate the memory address– Mem: Read the data from the Data Memory– WB: Write the data back to the register fileCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5Ifetch Reg/Dec Exec Mem WBLoadSlide: Dave PattersonIFetch Dec Exec Mem WBIFetch Dec Exec Mem WBIFetch Dec Exec Mem WBIFetch Dec Exec Mem WBIFetch Dec Exec Mem WBIFetch Dec Exec Mem WBProgram FlowTime Pipelining improves performance by increasing instruction throughputInstruction PipeliningInstruction Pipelining• Start handling next instruction while the currentinstruction is in progress• Feasible when different devices at different stages ! Time between instructionspipelined=Time between instructionsnonpipelinedNumber of pipe stagesIdeal and upper bound for speedup is number of stages in the pipelineInstructionfetchReg ALUDataaccessReg8 nsInstructionfetchReg ALUDataaccessReg8 nsInstructionfetch8 nsTimelw $1, 100($0)lw $2, 200($0)lw $3, 300($0)2 4 6 8 10 12 14 16 182 4 6 8 10 1214...Programexecutionorder(in instructions)InstructionfetchReg ALUDataaccessRegTimelw $1, 100($0)lw $2, 200($0)lw $3, 300($0)2 nsInstructionfetchReg ALUDataaccessReg2 nsInstructionfetchReg ALUDataaccessReg2 ns 2 ns 2 ns 2 ns 2 nsProgramexecutionorder(in instructions)Time between first& fourthinstructions is 3 ! 2= 6 nsTime between first& fourthinstructions is 3 ! 8= 24 nsExample of InstructionExample of InstructionPipeliningPipeliningSingle CycleSingle CycleClkLoad Store WasteCycle 1 Cycle 2Figure: Dave Patterson• Cycle time long enough for longest instruction• Shorter instructions waste time• No overlapMultiple CycleMultiple CycleFigure: Dave Patterson• Cycle time long enough for longest stage• Shorter stages waste time• Shorter instructions can take fewer cycles• No overlapCycle 1Ifetch Reg Exec Mem WrCycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10Ifetch Reg Exec MemLoad StoreIfetchR-typeClkPipelinePipelineFigure: Dave Patterson• Cycle time long enough for longest stage• Shorter stages waste time• No additional benefit from shorter instructions• Overlap instruction executionCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10ClkLoad Ifetch Reg Exec Mem WrIfetch Reg Exec Mem WrStoreIfetch Reg Exec Mem WrR-typePipeline PerformancePipeline Performance• Pipeline increases the instruction throughput– not execution time of an individual instruction• An individual instruction can be slower:– Additional pipeline


View Full Document

UMBC CMSC 611 - Pipelining

Download Pipelining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Pipelining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Pipelining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?