COMP 206: Computer Architecture and ImplementationOutlineReview: Instruction-Level Parallelism (ILP)Motivating Example for Loop UnrollingHow Far Can We Get With Scheduling?Observations on Scheduled CodeUnrolling: Take 1Unrolling: Take 2Unrolling: Take 3Unrolling: Take 4What Did The Compiler Have To Do?Limits to Gain from Loop UnrollingDependences in the Loop ContextData and Name DependencesControl DependencesData Dependence in Loop IterationsLoop TransformationSoftware PipeliningSoftware Pipelining ExampleSoftware Pipelining: ConceptAn Abstract View of Software PipeliningOther Compiler TechniquesVery Long Instruction Word (VLIW)Loop Unrolling in VLIWTrace Scheduling (briefly)Trace Scheduling1COMP 206:COMP 206:Computer Architecture and Computer Architecture and ImplementationImplementationMontek SinghMontek SinghWed., Oct. 29, 2003Wed., Oct. 29, 2003Topic: Topic: Software Approaches for ILPSoftware Approaches for ILP(Compiler Techniques) contd.(Compiler Techniques) contd.2OutlineOutlineMotivationMotivationCompiler schedulingCompiler schedulingLoop unrollingLoop unrollingSoftware pipeliningSoftware pipeliningStatic branch predictionStatic branch predictionVLIWVLIWReading: HP3, Sections 4.1-4.5Reading: HP3, Sections 4.1-4.53Review: Instruction-Level Parallelism Review: Instruction-Level Parallelism (ILP)(ILP)Pipelining most effective when: parallelism among instrsPipelining most effective when: parallelism among instrsinstrs instrs uu and and vv are parallel if neither are parallel if neither P(u,v)P(u,v) nor nor P(v,u)P(v,u) holds holdsProblem: parallelism within a basic block is limitedProblem: parallelism within a basic block is limitedbranch freq of 15%: implies about 6 instructions in basic branch freq of 15%: implies about 6 instructions in basic blockblockthese instructions are likely to depend on each otherthese instructions are likely to depend on each otherneed to look beyond basic blocksneed to look beyond basic blocksSolution: exploit loop-level parallelismSolution: exploit loop-level parallelismi.e., parallelism i.e., parallelism across loop iterationsacross loop iterationsto convert loop-level parallelism into ILP, need to “unroll” the to convert loop-level parallelism into ILP, need to “unroll” the looploopdynamicallydynamically, by the hardware, by the hardwarestaticallystatically, by the compiler, by the compilerusing using vectorvector instructions instructions–same operation is applied to all the vector elementssame operation is applied to all the vector elements4Motivating Example for Loop Motivating Example for Loop UnrollingUnrolling for (i = 1000; i > 0; i--) x[i] = x[i] + s;Assumptions•Scalar s is in register F2•Array x starts at memory address 0•1-cycle branch delay•No structural hazardsLOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 BNEZ R1, LOOP NOPLOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 BNEZ R1, LOOP NOP1 2 3 4 5 6 7 8 9 10 11 12 13 14 15L.D F D X M WADD.D - F D E E E E M WS.D - - F D X M WDADDUI F D X M WBNEZ - F D X M WNOP -L.D F D X M W10 cycles per iteration5How Far Can We Get With How Far Can We Get With Scheduling?Scheduling?LOOP: L.D F0, 0(R1) DADDUI R1, R1, -8 ADD.D F4, F0, F2 nop BNEZ R1, LOOP S.D 8(R1), F4LOOP: L.D F0, 0(R1) DADDUI R1, R1, -8 ADD.D F4, F0, F2 nop BNEZ R1, LOOP S.D 8(R1), F4LOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 BNEZ R1, LOOP NOPLOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 BNEZ R1, LOOP NOP1 2 3 4 5 6 7 8 9 10 11L.D F D X M WDADDUI F D X M WADD.D F D E E E E M Wnop F DBNEZ F D X M WS.D F D X M W6 cycles per iterationNote change in S.D instruction, from 0(R1) to 8(R1); this is a non-trivial change!6Observations on Scheduled CodeObservations on Scheduled Code3 out of 5 instructions involve FP work3 out of 5 instructions involve FP workThe other two constitute loop overheadThe other two constitute loop overheadCould we improve performance by unrolling Could we improve performance by unrolling the loop?the loop?assume number of loop iterations is a multiple of 4, assume number of loop iterations is a multiple of 4, and unroll loop body four timesand unroll loop body four timesin real life, must also handle loop counts that are not in real life, must also handle loop counts that are not multiples of 4multiples of 47Unrolling: Take 1Unrolling: Take 1Even though we have Even though we have gotten rid of the control gotten rid of the control dependences, we have dependences, we have data dependences through data dependences through R1R1We could remove data We could remove data dependences by observing dependences by observing that R1 is decremented by that R1 is decremented by 8 each time8 each timeAdjust the address Adjust the address specifiersspecifiersDelete the first three Delete the first three DADDUI’sDADDUI’sChange the constant in the Change the constant in the fourth DADDUI to 32fourth DADDUI to 32These are non-trivial These are non-trivial inferences for a compiler inferences for a compiler to maketo makeLOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 BNEZ R1, LOOP NOPLOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 BNEZ R1, LOOP NOP8Unrolling: Take 2Unrolling: Take 2Performance is now Performance is now limited by the WAR limited by the WAR dependencies on dependencies on F0F0These are name These are name dependencesdependencesThe instructions are not The instructions are not in a producer-consumer in a producer-consumer relationrelationThey are simply using They are simply using the same registers, but the same registers, but
View Full Document