Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 181Lecture 18: Core Design• Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue2The Alpha 21264 Out-of-Order ImplementationBranch predictionand instr fetchR1  R1+R2R2  R1+R3BEQZ R2R3  R1+R2R1  R3+R2Instr Fetch QueueDecode &RenameInstr 1Instr 2Instr 3Instr 4Instr 5Instr 6Reorder Buffer (ROB)P33  P1+P2P34  P33+P3BEQZ P34P35  P33+P34P36  P35+P34Issue Queue (IQ)ALU ALU ALURegister FileP1-P64Results written toregfile and tagsbroadcast to IQSpeculativeReg MapR1P36R2P34CommittedReg MapR1P1R2P2RenameA lr1  lr2 + lr3B lr2  lr4 + lr5C lr6  lr1 + lr3D lr6  lr1 + lr2 RAR lr3 RAW lr1 WAR lr2 WAW lr6 A ; BC ; Dpr7  pr2 + pr3pr8  pr4 + pr5pr9  pr7 + pr3pr10  pr7 + pr8RAR pr3RAW pr7WAR xWAW xAB ; CDCommit ExampleA lr1  lr2 + lr3B lr2  lr4 + lr5C lr6  lr1 + lr3D lr6  lr1 + lr2E lr3  lr6 + lr2F lr4  lr3 + lr4pr7  pr2 + pr3pr8  pr4 + pr5pr9  pr7 + pr3pr10  pr7 + pr8pr1  pr10 + pr8pr2  pr1 + pr4Assume a processor with 6 logical regs and 10 physical regsMap Old / Newlr1 pr1 pr7lr2 pr2 pr8lr6 pr6 pr9lr6 pr9 pr10lr3 pr3 pr1lr4 pr4 pr25Out-of-Order Loads/StoresLd R1  [R2]LdStLdLdR3  [R4]R5  [R6]R7  [R8]R9[R10]6Memory Dependence CheckingLd 0x abcdefLdStLdLd 0x abcdefSt 0x abcd00Ld 0x abc000Ld 0x abcd00• The issue queue checks for register dependences and executes instructions as soon as registers are ready• Loads/stores access memory as well – must check for RAW, WAW, and WAR hazards for memory as well• Hence, first check for register dependences to compute effective addresses; then check for memory dependences7Memory Dependence CheckingLd 0x abcdefLdStLdLd 0x abcdefSt 0x abcd00Ld 0x abc000Ld 0x abcd00• Load and store addresses are maintained in program order in the Load/Store Queue (LSQ)• Loads can issue if they are guaranteed to not have true dependences with earlier stores• Stores can issue only if we are ready to modify memory (can not recover if an earlier instr raises an exception)8The Alpha 21264 Out-of-Order ImplementationBranch predictionand instr fetchR1  R1+R2R2  R1+R3BEQZ R2R3  R1+R2R1  R3+R2LD R4  8[R3]ST R4  8[R1]Instr Fetch QueueDecode &RenameInstr 1Instr 2Instr 3Instr 4Instr 5Instr 6Instr 7Reorder Buffer (ROB)P33  P1+P2P34  P33+P3BEQZ P34P35  P33+P34P36  P35+P34P37  8[P35]P37  8[P36]Issue Queue (IQ)ALU ALU ALURegister FileP1-P64Results written toregfile and tagsbroadcast to IQP37  [P35 + 8]P37  [P36 + 8]LSQALUD-CacheCommittedReg MapR1P1R2P2SpeculativeReg MapR1P36R2P349Speculative Issue• Instr I1 leaves the issue queue at start of cycle 6; the instr then reads operands from the regfile, wires are traversed, instruction executes, result is available at end of cycle 8• If operand availability is broadcast to issue queue in cycle 9, dependent instruction leaves in cycle 10• This causes a 4-cycle gap between successive instrs• Hence, if we know that the instruction takes a cycle to execute, the operand is broadcast to the issue queue in cycle 6 and the dependent instr leaves issue queue in cycle 7; the input operand is correctly bypassed at the FU10Load Hit Speculation• The previous optimization assumes that we know the exact latency for every operation• This is true for all ops except loads (cache hit or miss?)• Assume hit and schedule accordingly; on a cache miss, must squash all speculatively issued instructions; an instruction therefore sits in the queue until load hits are determinedRegister Rename LogicMapTableDependenceCheckLogicMuxLogicalSourceRegsLogicalDestRegsLogicalSource RegPhysicalSourceRegsPhysicalDestRegsFree PoolMap Table – RAMPhys reg idNum entries =Num logical regsShadow copies (shift register)7-bits 7-bits 7-bits 7-bits 7-bitsMap Table – CAMLogical reg idNum entries =Num phys regsShadow copies5-bits 1-bitvalid1-bitWakeup LogicrdyL rdyRtagRtagLor= =ortag1 tagIW…rdyL rdyRtagRtagL......Selection LogicIssue windowreq grantanyreqenableenableArbiter cell• For multiple FUs, will need sequential selectors16Structure Complexities• Critical structures: register map tables, issue queue, LSQ, register file, register bypass• Cycle time is heavily influenced by: window size (physical register size), issue width (#FUs)• Conflict between the desire to increase IPC and clock speedILP Limits Wall 199318Title•


View Full Document

U of U CS 7810 - Core Design

Download Core Design
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Core Design and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Core Design 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?