1Lecture 18: Core Design• Today: basics of implementing a correct ooo core:register renaming, commit, LSQ, issue queue2The Alpha 21264 Out-of-Order ImplementationBranch predictionand instr fetchR1 R1+R2R2 R1+R3BEQZ R2R3 R1+R2R1 R3+R2Instr Fetch QueueDecode &RenameInstr 1Instr 2Instr 3Instr 4Instr 5Instr 6Reorder Buffer (ROB)P33 P1+P2P34 P33+P3BEQZ P34P35 P33+P34P36 P35+P34Issue Queue (IQ)ALU ALU ALURegister FileP1-P64Results written toregfile and tagsbroadcast to IQSpeculativeReg MapR1P36R2P34CommittedReg MapR1P1R2P2RenameA lr1 lr2 + lr3B lr2 lr4 + lr5C lr6 lr1 + lr3D lr6 lr1 + lr2RAR lr3RAW lr1WAR lr2WAW lr6A ; BC ; Dpr7 pr2 + pr3pr8 pr4 + pr5pr9 pr7 + pr3pr10 pr7 + pr8RAR pr3RAW pr7WAR xWAW xAB ; CDCommit ExampleA lr1 lr2 + lr3B lr2 lr4 + lr5C lr6 lr1 + lr3D lr6 lr1 + lr2E lr3 lr6 + lr2F lr4 lr3 + lr4pr7 pr2 + pr3pr8 pr4 + pr5pr9 pr7 + pr3pr10 pr7 + pr8pr1 pr10 + pr8pr2 pr1 + pr4Assume a processor with 6 logical regs and 10 physical regsMap Old / Newlr1 pr1 pr7lr2 pr2 pr8lr6 pr6 pr9lr6 pr9 pr10lr3 pr3 pr1lr4 pr4 pr25Out-of-Order Loads/StoresLd R1 [R2]LdStLdLdR3 [R4]R5 [R6]R7 [R8]R9[R10]6Memory Dependence CheckingLd 0x abcdefLdStLdLd 0x abcdefSt 0x abcd00Ld 0x abc000Ld 0x abcd00• The issue queue checks forregister dependences and executes instructions as soonas registers are ready• Loads/stores access memoryas well – must check for RAW,WAW, and WAR hazards formemory as well• Hence, first check for registerdependences to computeeffective addresses; then checkfor memory dependences7Memory Dependence CheckingLd 0x abcdefLdStLdLd 0x abcdefSt 0x abcd00Ld 0x abc000Ld 0x abcd00• Load and store addresses aremaintained in program order inthe Load/Store Queue (LSQ)• Loads can issue if they areguaranteed to not have truedependences with earlier stores• Stores can issue only if we areready to modify memory (can notrecover if an earlier instr raisesan exception)8The Alpha 21264 Out-of-Order ImplementationBranch predictionand instr fetchR1 R1+R2R2 R1+R3BEQZ R2R3 R1+R2R1 R3+R2LD R4 8[R3]ST R4 8[R1]Instr Fetch QueueDecode &RenameInstr 1Instr 2Instr 3Instr 4Instr 5Instr 6Instr 7Reorder Buffer (ROB)P33 P1+P2P34 P33+P3BEQZ P34P35 P33+P34P36 P35+P34P37 8[P35]P37 8[P36]Issue Queue (IQ)ALU ALU ALURegister FileP1-P64Results written toregfile and tagsbroadcast to IQP37 [P35 + 8]P37 [P36 + 8]LSQALUD-CacheCommittedReg MapR1P1R2P2SpeculativeReg MapR1P36R2P349Speculative Issue• Instr I1 leaves the issue queue at start of cycle 6; the instrthen reads operands from the regfile, wires are traversed,instruction executes, result is available at end of cycle 8• If operand availability is broadcast to issue queue in cycle 9,dependent instruction leaves in cycle 10• This causes a 4-cycle gap between successive instrs• Hence, if we know that the instruction takes a cycle toexecute, the operand is broadcast to the issue queue incycle 6 and the dependent instr leaves issue queue incycle 7; the input operand is correctly bypassed at the FU10Load Hit Speculation• The previous optimization assumes that we know the exactlatency for every operation• This is true for all ops except loads (cache hit or miss?)• Assume hit and schedule accordingly; on a cache miss,must squash all speculatively issued instructions; aninstruction therefore sits in the queue until load hits aredeterminedRegister Rename LogicMapTableDependenceCheckLogicMuxLogicalSourceRegsLogicalDestRegsLogicalSource RegPhysicalSourceRegsPhysicalDestRegsFree PoolMap Table – RAMPhys reg idNum entries =Num logical regsShadow copies (shift register)7-bits 7-bits 7-bits 7-bits 7-bitsMap Table – CAMLogical reg idNum entries =Num phys regsShadow copies5-bits 1-bitvalid1-bitWakeup LogicrdyL rdyRtagRtagLor= =ortag1 tagIW…rdyL rdyRtagRtagL......Selection LogicIssue windowreq grantanyreqenableenableArbiter cell• For multiple FUs, will need sequential selectors16Structure Complexities• Critical structures: register map tables, issue queue, LSQ, register file,register bypass• Cycle time is heavily influenced by:window size (physical register size), issue width (#FUs)• Conflict between the desire to increase IPC and clock speedILP Limits Wall 199318Title•
View Full Document