Unformatted text preview:

COMP 206: Computer Architecture and ImplementationOutlineMultiple IssueA Superscalar Version of MIPSPerformance of Static SuperscalarMultiple-Issue with Dynamic SchedulingPerformance of Dynamic SuperscalarLimits of SuperscalarVery Long Instruction Word (VLIW)Loop Unrolling in VLIWLimits to Multi-Issue MachinesHardware Support for More ILPSlide 13Hardware support for More ILPFour Steps of Speculative Tomasulo AlgorithmResult Shift Register and Reorder BufferStep I: I-O Initiation, O-O Termination (RSRa)Step II: I-O Initiation, I-O Termination (RSRb)Step III: Use Re-order Buffer (ROB)ROB: I-O Initiation, I-O TerminationStates of Circular BufferComplexity of ROB1COMP 206:COMP 206:Computer Architecture and Computer Architecture and ImplementationImplementationMontek SinghMontek SinghWed, Oct 19, 2005Wed, Oct 19, 2005Topic: Topic: Instruction-Level ParallelismInstruction-Level Parallelism(Multiple-Issue, Speculation)(Multiple-Issue, Speculation)2OutlineOutlineMultiple-Issue ArchitecturesMultiple-Issue ArchitecturesSuperscalar processorsSuperscalar processorsVLIW (very long instruction word) processorsVLIW (very long instruction word) processorsSchedulingSchedulingStatically scheduled (using compiler techniques)Statically scheduled (using compiler techniques)Dynamically scheduled (using variants of Tomasulo’s Dynamically scheduled (using variants of Tomasulo’s alg.)alg.)Reading: HP3, Sections 3.6-3.7Reading: HP3, Sections 3.6-3.73Multiple IssueMultiple IssueEliminating data and control stalls can achieve CPI of 1Eliminating data and control stalls can achieve CPI of 1Can we decrease CPI below 1?Can we decrease CPI below 1?Not if we issue only one instruction per clock cycleNot if we issue only one instruction per clock cycleMultiple-issue processors allow multiple instructions to Multiple-issue processors allow multiple instructions to issue in a clock cycleissue in a clock cycleSuperscalar: issue varying numbers of instructions per clock Superscalar: issue varying numbers of instructions per clock (dynamic issue)(dynamic issue)Statically scheduled by compilerStatically scheduled by compilerDynamically scheduled by hardwareDynamically scheduled by hardwareVLIW: issue fixed number of instructions per clock (static VLIW: issue fixed number of instructions per clock (static issue)issue)Statically scheduled by compilerStatically scheduled by compilerExamplesExamplesSuperscalar: IBM PowerPC, Sun SuperSPARC, DEC Alpha, HP Superscalar: IBM PowerPC, Sun SuperSPARC, DEC Alpha, HP 80008000VLIW: Intel/HP ItaniumVLIW: Intel/HP Itanium4A Superscalar Version of MIPSA Superscalar Version of MIPSTwo instructions can be issued per clock cycleTwo instructions can be issued per clock cycleOne can be load/store/branch/integer operationOne can be load/store/branch/integer operationOther can be any FP operationOther can be any FP operationNeed to fetch and decode 64 bits per cycleNeed to fetch and decode 64 bits per cycleInstructions paired and aligned on 64-bit boundaryInstructions paired and aligned on 64-bit boundaryInteger instruction appears firstInteger instruction appears firstDynamic issueDynamic issueFirst instruction issues if independent and satisfies First instruction issues if independent and satisfies other criteriaother criteriaSecond instruction issues only if first one does, and is Second instruction issues only if first one does, and is independent and satisfies similar criteriaindependent and satisfies similar criteriaLimitationLimitationOne-cycle delay for loads and branches now turns into One-cycle delay for loads and branches now turns into three-instruction delay!three-instruction delay!… … because instructions are now squeezed closer togetherbecause instructions are now squeezed closer together5Performance of Static SuperscalarPerformance of Static SuperscalarLOOP:LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4SUBI R1, R1, 8BNEZ R1, LOOPLOOP:LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4SUBI R1, R1, 8BNEZ R1, LOOPLOOP: LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1) ADDD F4, F0, F2LD F14, -24(R1) ADDD F8, F6, F2LD F18, -32(R1) ADDD F12, F10, F2SD 0(R1), F4 ADDD F16, F14, F2SD -8(R1), F8 ADDD F20, F18, F2SD -16(R1), F12SUBI R1, R1, 40SD 16(R1), F16BNEZ R1, LOOPSD 8(R1), F20LOOP: LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1) ADDD F4, F0, F2LD F14, -24(R1) ADDD F8, F6, F2LD F18, -32(R1) ADDD F12, F10, F2SD 0(R1), F4 ADDD F16, F14, F2SD -8(R1), F8 ADDD F20, F18, F2SD -16(R1), F12SUBI R1, R1, 40SD 16(R1), F16BNEZ R1, LOOPSD 8(R1), F20Loop unrolled five times and scheduled staticallyLoop unrolled five times and scheduled statically6 cycles per element in original scheduled code6 cycles per element in original scheduled code2.4 cycles per element in superscalar code (2.5x)2.4 cycles per element in superscalar code (2.5x)Loop unrolling gets us from 6 to 3.5 cycles per element (1.7x)Loop unrolling gets us from 6 to 3.5 cycles per element (1.7x)Superscalar execution from 3.5 to 2.4 cycles per element (1.5x)Superscalar execution from 3.5 to 2.4 cycles per element (1.5x)6Multiple-Issue with Dynamic Multiple-Issue with Dynamic SchedulingSchedulingExtend Tomasulo’s algorithmExtend Tomasulo’s algorithmsupport issuing 2 instr/cycle: 1 integer, 1 FPsupport issuing 2 instr/cycle: 1 integer, 1 FPSimple approach: Simple approach: separate Tomasulo Control for Integer and FP units:separate Tomasulo Control for Integer and FP units:one set of reservation stations for Integer unit and one for one set of reservation stations for Integer unit and one for FP unitFP unitHow to do instruction issue with two How to do instruction issue with two instructions and keep in-order instruction issue instructions and keep in-order instruction issue for Tomasulo?for Tomasulo?issue logic runs in one-half clock cycleissue logic runs in one-half clock cyclecan do two in-order issues in one clock cyclecan do two in-order issues in one clock cycle7Performance of Dynamic Performance of Dynamic SuperscalarSuperscalarIter. no.Iter. no.InstructionsInstructionsIssues Issues ExecutesExecutesWrites resultWrites result (clock-cycle number)(clock-cycle number)11L.D L.D F0F0,0(R1),0(R1)11224411ADD.D ADD.D F4F4,,F0F0,F2,F211558811S.D 0(R1),S.D 0(R1),F4F4229911SUBI R1,R1,8SUBI R1,R1,833445511BNEZ R1,LOOPBNEZ R1,LOOP445522L.D F0,0(R1)L.D F0,0(R1)55668822ADD.D F4,F0,F2ADD.D F4,F0,F25599121222S.D


View Full Document

UNC-Chapel Hill COMP 206 - Instruction-Level Parallelism

Download Instruction-Level Parallelism
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Instruction-Level Parallelism and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Instruction-Level Parallelism 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?