CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, & Branch PredictionLast time in Lecture 12Out-of-Order IssueOvercoming the Lack of Register NamesInstruction-level Parallelism via RenamingRegister RenamingDataflow executionRenaming & Out-of-order Issue An exampleData-Driven ExecutionSimplifying Allocation/DeallocationIBM 360/91 Floating-Point Unit R. M. Tomasulo, 1967Effectiveness?Precise InterruptsEffect on Interrupts Out-of-order CompletionException Handling (In-Order Five-Stage Pipeline)Phases of Instruction ExecutionIn-Order Commit for Precise ExceptionsExtensions for Precise ExceptionsRollback and RenamingRenaming TableCS152 AdministriviaControl Flow PenaltyMIPS Branches and JumpsBranch Penalties in Modern PipelinesReducing Control Flow PenaltyBranch PredictionStatic Branch PredictionDynamic Branch Prediction learning based on past behaviorBranch Prediction BitsBranch History TableAcknowledgementsCS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, & Branch PredictionKrste AsanovicElectrical Engineering and Computer SciencesUniversity of California at Berkeleyhttp://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs1523/12/2009 CS152-Spring’092Last time in Lecture 12•Pipelining is complicated by multiple and/or variable latency functional units•Out-of-order and/or pipelined execution requires tracking of dependencies–RAW–WAR–WAW•Dynamic issue logic can support out-of-order execution to improve performance–Last time, looked at simple scoreboard to track out-of-order completion•Hardware register renaming can further improve performance by removing hazards.3/12/2009 CS152-Spring’093Out-of-Order Issue•Issue stage buffer holds multiple instructions waiting to issue.•Decode adds next instruction to buffer if there is space and the instruction does not cause a WAR or WAW hazard.–Note: WAR possible again because issue is out-of-order (WAR not possible with in-order issue and latching of input operands at functional unit)•Any instruction in buffer whose RAW hazards are satisfied can be issued (for now at most one dispatch per cycle). On a write back (WB), new instructions may get enabled.IF ID WBALU MemFaddFmulIssue3/12/2009 CS152-Spring’094Overcoming the Lack of Register NamesFloating Point pipelines often cannot be kept filled with small number of registers.IBM 360 had only 4 floating-point registersCan a microarchitecture use more registers than specified by the ISA without loss of ISA compatibility ?Robert Tomasulo of IBM suggested an ingenious solution in 1967 using on-the-fly register renaming3/12/2009 CS152-Spring’095Instruction-level Parallelism via Renaming latency1 LD F2, 34(R2) 12 LD F4, 45(R3) long3 MULTD F6, F4, F2 34 SUBD F8, F2, F2 15 DIVD F4’, F2, F8 46 ADDD F10, F6, F4’ 1In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6Out-of-order: 1 (2,1) 4 4 5 . . . 2 (3,5) 3 6 6123456XAny antidependence can be eliminated by renaming. (renaming additional storage) Can it be done in hardware?yes!3/12/2009 CS152-Spring’096Register Renaming•Decode does register renaming and adds instructions to the issue stage reorder buffer (ROB) renaming makes WAR or WAW hazards impossible•Any instruction in ROB whose RAW hazards have been satisfied can be dispatched. Out-of-order or dataflow executionIF ID WBALU MemFaddFmulIssue3/12/2009 CS152-Spring’097Dataflow executionInstruction slot is candidate for execution when:•It holds a valid instruction (“use” bit is set)•It has not already started execution (“exec” bit is clear)•Both operands are available (p1 and p2 are set)Reorder buffert1t2...tnptr2 next to deallocateptr1nextavailableIns# use exec op p1 src1 p2 src23/12/2009 CS152-Spring’098Renaming & Out-of-order IssueAn example• When are tags in sources replaced by data?• When can a name be reused?1 LD F2, 34(R2)2 LD F4, 45(R3)3 MULTD F6, F4, F24 SUBD F8, F2, F25 DIVD F4, F2, F86 ADDD F10, F6, F4Renaming table Reorder bufferIns# use exec op p1 src1 p2 src2t1t2t3t4t5..data / ti p dataF1F2F3F4F5F6F7F8Whenever an FU produces dataWhenever an instruction completest1 1 1 0 LD t2 2 1 0 LD 5 1 0 DIV 1 v1 0 t4 4 1 0 SUB 1 v1 1 v1t4 3 1 0 MUL 0 t2 1 v1t3t5v1v1 1 1 1 LD 0 4 1 1 SUB 1 v1 1 v1 4 0v4 5 1 0 DIV 1 v1 1 v4 2 1 1 LD 2 0 3 1 0 MUL 1 v2 1 v13/12/2009 CS152-Spring’099Data-Driven ExecutionRenaming table ® fileReorder bufferLoad UnitFUFUStore Unit< t, result >Ins# use exec op p1 src1 p2 src2t1t2..tn• Instruction template (i.e., tag t) is allocated by the Decode stage, which also associates tag with register in regfile• When an instruction completes, its tag is deallocatedReplacing the tag by its valueis an expensive operation3/12/2009 CS152-Spring’0910Simplifying Allocation/DeallocationInstruction buffer is managed circularly•“exec” bit is set when instruction begins execution •When an instruction completes its “use” bit is marked free• ptr2 is incremented only if the “use” bit is marked freeReorder buffert1t2...tnptr2 next to deallocateptr1nextavailableIns# use exec op p1 src1 p2 src23/12/2009 CS152-Spring’0911IBM 360/91 Floating-Point UnitR. M. Tomasulo, 1967Mult1123456loadbuffers(from memory)1234Adder123FloatingPointRegstore buffers(to memory)...instructionsCommon bus ensures that data is made available immediately to all the instructions waiting for it.Match tag, if equal, copy value & set presence “p”.distribute instruction templatesby functionalunits< tag, result >p tag/datap tag/datap tag/datap tag/datap tag/datap tag/datap tag/datap tag/datap tag/datap tag/datap tag/datap tag/datap tag/data2p tag/datap tag/datap tag/datap tag/datap tag/datap tag/datap tag/datap tag/datap tag/datap tag/data3/12/2009 CS152-Spring’0912Effectiveness?Renaming and Out-of-order execution was firstimplemented in 1969 in IBM 360/91 but did not show up in the subsequent models until mid-Nineties.Why ?Reasons1. Effective on a very small class of programs2. Memory latency a much bigger problem3. Exceptions not precise!One more problem
View Full Document