DOC PREVIEW
UMD CMSC 411 - Lecture 8 Instruction Level Parallelism

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CMSC 411CMSC 411Computer Systems ArchitectureLecture 8Instruction Level ParallelismInstruction Level ParallelismAlan Sussmanl@ d [email protected]• HW #2, on pipelining, due Thursday•Continue reading Chapter 2 of H&P•Continue reading Chapter 2 of H&P• First exam scheduled for Thursday, March 5– on Units 1-3CMSC 411 - 8 (from Patterson)2Outline• ILP• Compiler techniques to increase ILPpq• Loop Unrolling• Static Branch Prediction• Dynamic Branch Prediction• Overcoming Data Hazards with Dynamic SchedulingScheduling• (Start) Tomasulo Algorithm•ConclusionConclusionCMSC 411 - 8 (from Patterson)3Control Dependencies• Every instruction is control dependent on some set of branches and in general these controlset of branches, and, in general, these control dependencies must be preserved to preserve program orderif p1 {if p1 {S1;};if p2 {S2;}}•S1is control dependent on p1, and S2 is control dependent on p2 but not on p1.CMSC 411 - 8 (from Patterson)4Control Dependence Ignored• Control dependence need not be preserved– willing to execute instructions that should not have been executed, thereby violating the control dependences, if can do so without affecting correctness of the program • Instead, 2 properties critical to program correctness are 1)ibhid1)exception behaviorand 2) data flowCMSC 411 - 8 (from Patterson)5Exception Behavior• Preserving exception behavior ¾ any changes in instruction execution order must not change how yg gexceptions are raised in program (no new exceptions)• Example:DADDUR2 R3 R4DADDUR2,R3,R4BEQZ R2,L1LW R1,0(R2)L1L1:– (Assume branches not delayed)•Problem with movingLWbeforeBEQZ?Problem with moving LWbefore BEQZ?CMSC 411 - 8 (from Patterson)6Data Flow• Data flow: actual flow of data values among instructions that produce results and those that consume themproduce results and those that consume them– branches make flow dynamic, determine which instruction is supplier of data•Example:DADDU R1,R2,R3BEQZ R4,LDSUBU R1,R5,R6LL: …OR R7,R1,R8•ORdepends on DADDU or DSUBU? MdfliMust preserve data flow on executionCMSC 411 - 8 (from Patterson)7Outline• ILP• Compiler techniques to increase ILPpq• Loop Unrolling• Static Branch Prediction• Dynamic Branch Prediction• Overcoming Data Hazards with Dynamic SchedulingScheduling• (Start) Tomasulo Algorithm•ConclusionConclusionCMSC 411 - 8 (from Patterson)8Software Techniques - Exampleqp• This code, add a scalar to a vector:for (i=1000; i>0; i=i–1)x[i] = x[i] + s;•Assume following latencies for all examples•Assume following latencies for all examples– Ignore delayed branch in these examplesInstruction Instruction Latency stalls betweendilt i lt i l ilproducing result using result in cycles in cyclesFP ALU op Another FP ALU op 4 3FP ALU op Store double 3 2 Lddbl FPALU11Load double FP ALU op1 1Load double Store double 1 0Integer op Integer op 1 0CMSC 411 - 8 (from Patterson)9FP Loop: Where are the Hazards?p• First translate into MIPS code: -To simplify, assume 8 is lowest addressfor (i=1000; i>0; i=i–1)x[i] = x[i] + s;Loop: L.D F0,0(R1) ;F0=vector elementADD.D F4,F0,F2 ;add scalar from F2S.D 0(R1),F4 ;store resultDADDUI R1,R1,-8 ;decrement pointer 8B (DW)BNEZ R1,Loop ;branch R1!=zeroCMSC 411 - 8 (from Patterson)10FP Loop Showing Stallsfor (i=1000; i>0; i=i–1)x[i] = x[i] + s;1 Loop: L.D F0,0(R1) ;F0=vector element2 stall3 ADD.D F4,F0,F2 ;add scalar in F24 stall5 stall6SD0(R1)F4;storeresultplus branch delay!6 S.D0(R1),F4;store result7 DADDUI R1,R1,-8 ;decrement pointer 8B (DW)8stall ;assumes can’t forward to branch9 BNEZ R1,Loop;branch R1!=zeroInstruction Instruction Latency inproducingresult usingresultclockcycles,p;producing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Loaddouble FPALUop 1CMSC 411 - 8 (from Patterson)• 9 clock cycles: Rewrite code to minimize stalls?Load double FP ALU op 111Revised FP Loop Minimizing Stallspg1 Loop: L.D F0,0(R1)2 DADDUI R1,R1,-83 ADD.D F4,F0,F24 stall5 stall6 S.D 8(R1),F4;altered offset when move DSUBUI7 BNEZ R1,LoopS DADDUI d S D b h i dd f S DInstruction Instruction Latency inproducing result using result clock cyclesFP ALUop AnotherFP ALUop 3Swap DADDUI and S.D by changing address of S.DFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1CMSC 411 - 8 (from Patterson)7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead; How make faster?12Unroll Loop Four Times (straightforward way)Rewrite loop to minimizestalls?1 Loop:L.D F0,0(R1)3 ADD.D F4,F0,F21 cycle stall2 cycles stallminimize stalls?3 ADD.D F4,F0,F26 S.D 0(R1),F4 ;drop DSUBUI & BNEZ7 L.D F6,-8(R1)9 ADD.D F8,F6,F212 S D8(R1)F8d DSUBUI BNEZy12 S.D -8(R1),F8 ;drop DSUBUI & BNEZ13 L.D F10,-16(R1)15 ADD.D F12,F10,F218 S.D -16(R1),F12 ;drop DSUBUI & BNEZ(),p19 L.D F14,-24(R1)21 ADD.D F16,F14,F224 S.D -24(R1),F1625 DADDUI R1 R1#32 ;alterto 4*825 DADDUI R1,R1,#-32 ;alter to 4*827 BNEZ R1,LOOP27 clock cycles, or 6.75 per iterationCMSC 411 - 8 (from Patterson)27 clock cycles, or 6.75 per iteration(Assumes R1 is multiple of 4)13Unrolled Loop Detail• Do not usually know upper bound of loop•Suppose it isnand we would like to unroll the loopSuppose it is n, and we would like to unroll the loop to make k copies of the body• Instead of a single unrolled loop, we generate a pair of consecutive loops:– 1st executes (nmodk) times and has a body that is the original loop– 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times•For large values ofnmost of the execution time willFor large values of n, most of the execution time will be spent in the unrolled loopCMSC 411 - 8 (from Patterson)14Unrolled Loop That Minimizes Stalls1 Loop: L.D F0,0(R1)2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD.D F4,F0,F26 ADDDF8F6 F26 ADD.D F8,F6,F27 ADD.D F12,F10,F28


View Full Document

UMD CMSC 411 - Lecture 8 Instruction Level Parallelism

Documents in this Course
Load more
Download Lecture 8 Instruction Level Parallelism
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 8 Instruction Level Parallelism and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 8 Instruction Level Parallelism 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?