COMP 206: Computer Architecture and ImplementationOutlineMultiple IssueA Superscalar Version of MIPSPerformance of Static SuperscalarMultiple-Issue with Dynamic SchedulingPerformance of Dynamic SuperscalarLimits of SuperscalarVery Long Instruction Word (VLIW)Loop Unrolling in VLIWLimits to Multi-Issue MachinesHardware Support for More ILPSlide 13Hardware support for More ILPFour Steps of Speculative Tomasulo AlgorithmResult Shift Register and Reorder BufferStep I: I-O Initiation, O-O Termination (RSRa)Step II: I-O Initiation, I-O Termination (RSRb)Step III: Use Re-order Buffer (ROB)ROB: I-O Initiation, I-O TerminationStates of Circular BufferComplexity of ROB1COMP 206:COMP 206:Computer Architecture and Computer Architecture and ImplementationImplementationMontek SinghMontek SinghWed, Oct 19, 2005Wed, Oct 19, 2005Topic: Topic: Instruction-Level ParallelismInstruction-Level Parallelism(Multiple-Issue, Speculation)(Multiple-Issue, Speculation)2OutlineOutlineMultiple-Issue ArchitecturesMultiple-Issue ArchitecturesSuperscalar processorsSuperscalar processorsVLIW (very long instruction word) processorsVLIW (very long instruction word) processorsSchedulingSchedulingStatically scheduled (using compiler techniques)Statically scheduled (using compiler techniques)Dynamically scheduled (using variants of Tomasulo’s Dynamically scheduled (using variants of Tomasulo’s alg.)alg.)Reading: HP3, Sections 3.6-3.7Reading: HP3, Sections 3.6-3.73Multiple IssueMultiple IssueEliminating data and control stalls can achieve CPI of 1Eliminating data and control stalls can achieve CPI of 1Can we decrease CPI below 1?Can we decrease CPI below 1?Not if we issue only one instruction per clock cycleNot if we issue only one instruction per clock cycleMultiple-issue processors allow multiple instructions to Multiple-issue processors allow multiple instructions to issue in a clock cycleissue in a clock cycleSuperscalar: issue varying numbers of instructions per clock Superscalar: issue varying numbers of instructions per clock (dynamic issue)(dynamic issue)Statically scheduled by compilerStatically scheduled by compilerDynamically scheduled by hardwareDynamically scheduled by hardwareVLIW: issue fixed number of instructions per clock (static VLIW: issue fixed number of instructions per clock (static issue)issue)Statically scheduled by compilerStatically scheduled by compilerExamplesExamplesSuperscalar: IBM PowerPC, Sun SuperSPARC, DEC Alpha, HP Superscalar: IBM PowerPC, Sun SuperSPARC, DEC Alpha, HP 80008000VLIW: Intel/HP ItaniumVLIW: Intel/HP Itanium4A Superscalar Version of MIPSA Superscalar Version of MIPSTwo instructions can be issued per clock cycleTwo instructions can be issued per clock cycleOne can be load/store/branch/integer operationOne can be load/store/branch/integer operationOther can be any FP operationOther can be any FP operationNeed to fetch and decode 64 bits per cycleNeed to fetch and decode 64 bits per cycleInstructions paired and aligned on 64-bit boundaryInstructions paired and aligned on 64-bit boundaryInteger instruction appears firstInteger instruction appears firstDynamic issueDynamic issueFirst instruction issues if independent and satisfies First instruction issues if independent and satisfies other criteriaother criteriaSecond instruction issues only if first one does, and is Second instruction issues only if first one does, and is independent and satisfies similar criteriaindependent and satisfies similar criteriaLimitationLimitationOne-cycle delay for loads and branches now turns into One-cycle delay for loads and branches now turns into three-instruction delay!three-instruction delay!… … because instructions are now squeezed closer togetherbecause instructions are now squeezed closer together5Performance of Static SuperscalarPerformance of Static SuperscalarLOOP:LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4SUBI R1, R1, 8BNEZ R1, LOOPLOOP:LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4SUBI R1, R1, 8BNEZ R1, LOOPLOOP: LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1) ADDD F4, F0, F2LD F14, -24(R1) ADDD F8, F6, F2LD F18, -32(R1) ADDD F12, F10, F2SD 0(R1), F4 ADDD F16, F14, F2SD -8(R1), F8 ADDD F20, F18, F2SD -16(R1), F12SUBI R1, R1, 40SD 16(R1), F16BNEZ R1, LOOPSD 8(R1), F20LOOP: LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1) ADDD F4, F0, F2LD F14, -24(R1) ADDD F8, F6, F2LD F18, -32(R1) ADDD F12, F10, F2SD 0(R1), F4 ADDD F16, F14, F2SD -8(R1), F8 ADDD F20, F18, F2SD -16(R1), F12SUBI R1, R1, 40SD 16(R1), F16BNEZ R1, LOOPSD 8(R1), F20Loop unrolled five times and scheduled staticallyLoop unrolled five times and scheduled statically6 cycles per element in original scheduled code6 cycles per element in original scheduled code2.4 cycles per element in superscalar code (2.5x)2.4 cycles per element in superscalar code (2.5x)Loop unrolling gets us from 6 to 3.5 cycles per element (1.7x)Loop unrolling gets us from 6 to 3.5 cycles per element (1.7x)Superscalar execution from 3.5 to 2.4 cycles per element (1.5x)Superscalar execution from 3.5 to 2.4 cycles per element (1.5x)6Multiple-Issue with Dynamic Multiple-Issue with Dynamic SchedulingSchedulingExtend Tomasulo’s algorithmExtend Tomasulo’s algorithmsupport issuing 2 instr/cycle: 1 integer, 1 FPsupport issuing 2 instr/cycle: 1 integer, 1 FPSimple approach: Simple approach: separate Tomasulo Control for Integer and FP units:separate Tomasulo Control for Integer and FP units:one set of reservation stations for Integer unit and one for one set of reservation stations for Integer unit and one for FP unitFP unitHow to do instruction issue with two How to do instruction issue with two instructions and keep in-order instruction issue instructions and keep in-order instruction issue for Tomasulo?for Tomasulo?issue logic runs in one-half clock cycleissue logic runs in one-half clock cyclecan do two in-order issues in one clock cyclecan do two in-order issues in one clock cycle7Performance of Dynamic Performance of Dynamic SuperscalarSuperscalarIter. no.Iter. no.InstructionsInstructionsIssues Issues ExecutesExecutesWrites resultWrites result (clock-cycle number)(clock-cycle number)11L.D L.D F0F0,0(R1),0(R1)11224411ADD.D ADD.D F4F4,,F0F0,F2,F211558811S.D 0(R1),S.D 0(R1),F4F4229911SUBI R1,R1,8SUBI R1,R1,833445511BNEZ R1,LOOPBNEZ R1,LOOP445522L.D F0,0(R1)L.D F0,0(R1)55668822ADD.D F4,F0,F2ADD.D F4,F0,F25599121222S.D
View Full Document