UNC-Chapel Hill COMP 206 - Software Approaches for ILP - D2952294

Home> Schools> University of North Carolina at Chapel Hill> Computer Science (COMP) > COMP 206> Software Approaches for ILP

DOC PREVIEW

UNC-Chapel Hill COMP 206 - Software Approaches for ILP

School name University of North Carolina at Chapel Hill

Course Comp 206- Computer Architecture and Implementation

Pages 26

This preview shows page 1-2-3-24-25-26 out of 26 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

COMP 206: Computer Architecture and ImplementationOutlineReview: Instruction-Level Parallelism (ILP)Motivating Example for Loop UnrollingHow Far Can We Get With Scheduling?Observations on Scheduled CodeUnrolling: Take 1Unrolling: Take 2Unrolling: Take 3Unrolling: Take 4What Did The Compiler Have To Do?Limits to Gain from Loop UnrollingDependences in the Loop ContextData and Name DependencesControl DependencesData Dependence in Loop IterationsLoop TransformationSoftware PipeliningSoftware Pipelining ExampleSoftware Pipelining: ConceptAn Abstract View of Software PipeliningOther Compiler TechniquesVery Long Instruction Word (VLIW)Loop Unrolling in VLIWTrace Scheduling (briefly)Trace Scheduling1COMP 206:COMP 206:Computer Architecture and Computer Architecture and ImplementationImplementationMontek SinghMontek SinghWed., Oct. 29, 2003Wed., Oct. 29, 2003Topic: Topic: Software Approaches for ILPSoftware Approaches for ILP(Compiler Techniques) contd.(Compiler Techniques) contd.2OutlineOutlineMotivationMotivationCompiler schedulingCompiler schedulingLoop unrollingLoop unrollingSoftware pipeliningSoftware pipeliningStatic branch predictionStatic branch predictionVLIWVLIWReading: HP3, Sections 4.1-4.5Reading: HP3, Sections 4.1-4.53Review: Instruction-Level Parallelism Review: Instruction-Level Parallelism (ILP)(ILP)Pipelining most effective when: parallelism among instrsPipelining most effective when: parallelism among instrsinstrs instrs uu and and vv are parallel if neither are parallel if neither P(u,v)P(u,v) nor nor P(v,u)P(v,u) holds holdsProblem: parallelism within a basic block is limitedProblem: parallelism within a basic block is limitedbranch freq of 15%: implies about 6 instructions in basic branch freq of 15%: implies about 6 instructions in basic blockblockthese instructions are likely to depend on each otherthese instructions are likely to depend on each otherneed to look beyond basic blocksneed to look beyond basic blocksSolution: exploit loop-level parallelismSolution: exploit loop-level parallelismi.e., parallelism i.e., parallelism across loop iterationsacross loop iterationsto convert loop-level parallelism into ILP, need to “unroll” the to convert loop-level parallelism into ILP, need to “unroll” the looploopdynamicallydynamically, by the hardware, by the hardwarestaticallystatically, by the compiler, by the compilerusing using vectorvector instructions instructions–same operation is applied to all the vector elementssame operation is applied to all the vector elements4Motivating Example for Loop Motivating Example for Loop UnrollingUnrolling for (i = 1000; i > 0; i--) x[i] = x[i] + s;Assumptions•Scalar s is in register F2•Array x starts at memory address 0•1-cycle branch delay•No structural hazardsLOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 BNEZ R1, LOOP NOPLOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 BNEZ R1, LOOP NOP1 2 3 4 5 6 7 8 9 10 11 12 13 14 15L.D F D X M WADD.D - F D E E E E M WS.D - - F D X M WDADDUI F D X M WBNEZ - F D X M WNOP -L.D F D X M W10 cycles per iteration5How Far Can We Get With How Far Can We Get With Scheduling?Scheduling?LOOP: L.D F0, 0(R1) DADDUI R1, R1, -8 ADD.D F4, F0, F2 nop BNEZ R1, LOOP S.D 8(R1), F4LOOP: L.D F0, 0(R1) DADDUI R1, R1, -8 ADD.D F4, F0, F2 nop BNEZ R1, LOOP S.D 8(R1), F4LOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 BNEZ R1, LOOP NOPLOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 BNEZ R1, LOOP NOP1 2 3 4 5 6 7 8 9 10 11L.D F D X M WDADDUI F D X M WADD.D F D E E E E M Wnop F DBNEZ F D X M WS.D F D X M W6 cycles per iterationNote change in S.D instruction, from 0(R1) to 8(R1); this is a non-trivial change!6Observations on Scheduled CodeObservations on Scheduled Code3 out of 5 instructions involve FP work3 out of 5 instructions involve FP workThe other two constitute loop overheadThe other two constitute loop overheadCould we improve performance by unrolling Could we improve performance by unrolling the loop?the loop?assume number of loop iterations is a multiple of 4, assume number of loop iterations is a multiple of 4, and unroll loop body four timesand unroll loop body four timesin real life, must also handle loop counts that are not in real life, must also handle loop counts that are not multiples of 4multiples of 47Unrolling: Take 1Unrolling: Take 1Even though we have Even though we have gotten rid of the control gotten rid of the control dependences, we have dependences, we have data dependences through data dependences through R1R1We could remove data We could remove data dependences by observing dependences by observing that R1 is decremented by that R1 is decremented by 8 each time8 each timeAdjust the address Adjust the address specifiersspecifiersDelete the first three Delete the first three DADDUI’sDADDUI’sChange the constant in the Change the constant in the fourth DADDUI to 32fourth DADDUI to 32These are non-trivial These are non-trivial inferences for a compiler inferences for a compiler to maketo makeLOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 BNEZ R1, LOOP NOPLOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 L.D F0, 0(R1) ADD.D F4, F0, F2 S.D 0(R1), F4 DADDUI R1, R1, -8 BNEZ R1, LOOP NOP8Unrolling: Take 2Unrolling: Take 2Performance is now Performance is now limited by the WAR limited by the WAR dependencies on dependencies on F0F0These are name These are name dependencesdependencesThe instructions are not The instructions are not in a producer-consumer in a producer-consumer relationrelationThey are simply using They are simply using the same registers, but the same registers, but

View Full Document