15-745 © Seth Copen Goldstein 2000-5 115-745Software Pipelining15-745 Optimizing CompilerSpring 2006Peter LeeCopyright © Seth Copen Goldstein 2000-5(some slides borrowed from M. Voss)15-745 © Seth Copen Goldstein 2000-5 2Trace Schedulingx>10?a=b+cd=a-3 a=b+cf=a+3a=b+cx>10?d=a-3 f=a+3a=e*fd=a-3a=b+cx=x+1a=e*fd=a-3……a=b+cx=x+1d=a-315-745 © Seth Copen Goldstein 2000-5 3Software Pipelining• Software pipelining is an IS technique that reordersthe instructions in a loop.– Possibly moving instructions from one iteration tothe previous or the next iteration.– Very large improvements in running time are possible.• The first serious approach to software pipelining waspresented by Aiken & Nicolau.– Aiken’s 1988 Ph.D. thesis.– Impractical as it ignores resource hazards (focusingonly on data-dependence constraints).• But sparked a large amount of follow-on research.15-745 © Seth Copen Goldstein 2000-5 4Goal of SP• Increase distance between dependentoperations by moving destination operation to alater iterationA: a ! ld [d]B: b ! a * aC: st [d], bD: d ! d + 4Assume all have latency of 2BA C D15-745 © Seth Copen Goldstein 2000-5 5Can we decrease the latency?• Lets unrollA: a ! ld [d]B: b ! a * aC: st [d], bD: d ! d + 4A1: a ! ld [d]B1: b ! a * aC1: st [d], bD1: d ! d + 4DCBA B1A1 C1 D115-745 © Seth Copen Goldstein 2000-5 6Rename variablesA: a ! ld [d]B: b ! a * aC: st [d], bD: d1 ! d + 4A1: a1 ! ld [d1]B1: b1 ! a1 * a1C1: st [d1], b1D1: d ! d1 + 4DCBA B1A1 C1 D115-745 © Seth Copen Goldstein 2000-5 7ScheduleA: a ! ld [d]B: b ! a * aC: st [d], bD: d1 ! d + 4A1: a1 ! ld [d1]B1: b1 ! a1 * a1C1: st [d1], b1D1: d ! d1 + 4ABCDA1B1C1D1B1CA1BDAC1D115-745 © Seth Copen Goldstein 2000-5 8Unroll Some MoreA: a ! ld [d]B: b ! a * aC: st [d], bD: d1 ! d + 4A1: a1 ! ld [d1]B1: b1 ! a1 * a1C1: st [d1], b1D1: d2 ! d1 + 4A2: a2 ! ld [d2]B2: b2 ! a2 * a2C2: st [d2], b2D2: d ! d2 + 4ABCDA1B1C1D1A2B2C2D2C1B1A1DCA2BD1AC2B2D215-745 © Seth Copen Goldstein 2000-5 9Unroll Some MoreA: a ! ld [d]B: b ! a * aC: st [d], bD: d1 ! d + 4A1: a1 ! ld [d1]B1: b1 ! a1 * a1C1: st [d1], b1D1: d2 ! d1 + 4A2: a2 ! ld [d2]B2: b2 ! a2 * a2C2: st [d2], b2D2: d ! d2 + 4ABCDA1B1C1D1A2B2C2D3C2B2A2D1C1B1A1DA3CD2BAC3B3D3D2A3B3C315-745 © Seth Copen Goldstein 2000-5 10One More TimeABCDA1B1C1D1A2B2C2D3C3B3A3D2B4C2B2A2D1C1B1A1DCD3BAC4A4D4D2A3B3C3A4B4C415-745 © Seth Copen Goldstein 2000-5 11Can RearrangeABCDA1B1C1D1A2B2C2D3C3B3A3D2B4C2B2A2D1C1B1A1DCD3BAC4A4D4D2A3B3C3A4B4C415-745 © Seth Copen Goldstein 2000-5 12RearrangeABCDA1B1C1D1A2B2C2D3B3C2B2A2D1C1B1A1DD2CBAC3A3D3D2A3B3C3A: a ! ld [d]B: b ! a * aC: st [d], bD: d1 ! d + 4A1: a1 ! ld [d1]B1: b1 ! a1 * a1C1: st [d1], b1D1: d2 ! d1 + 4A2: a2 ! ld [d2]B2: b2 ! a2 * a2C2: st [d2], b2D2: d ! d2 + 415-745 © Seth Copen Goldstein 2000-5 13RearrangeABCDA1B1C1D1A2B2C2D3B3C2B2A2D1C1B1A1DD2CBAC3A3D3D2A3B3C3A: a ! ld [d]B: b ! a * aC: st [d], bD: d1 ! d + 4A1: a1 ! ld [d1]B1: b1 ! a1 * a1C1: st [d1], b1D1: d2 ! d1 + 4A2: a2 ! ld [d2]B2: b2 ! a2 * a2C2: st [d2], b2D2: d ! d2 + 415-745 © Seth Copen Goldstein 2000-5 14SP LoopA: a ! ld [d]B: b ! a * aD: d1 ! d + 4A1: a1 ! ld [d1]D1: d2 ! d1 + 4C: st [d], bB1: b1 ! a1 * a1A2: a2 ! ld [d2]D2: d ! d2 + 4B2: b2 ! a2 * a2C1: st [d1], b1D3: d2 ! d1 + 4C2: st [d2], b2D2A2B1CC2B2A2A2D1C1B1B1A1DD2CBAD2D3CPrologBodyEpilog15-745 © Seth Copen Goldstein 2000-5 15Goal of SP• Increase distance between dependentoperations by moving destination operation to alater iterationABCdependenciesin initial loopABCiteration i i+1 i+2after SP15-745 © Seth Copen Goldstein 2000-5 16ExampleAssume operating on a infinite wide machineA0A1 B0A2 B1 C0A3 B2 C1B3 C2C3A0A1 B0AiBi-1Ci-2BiCi-1Ci15-745 © Seth Copen Goldstein 2000-5 17ExampleAssume operating on a infinite wide machineA0A1 B0AiBi-1Ci-2BiCi-1CiPrologepilogloop body15-745 © Seth Copen Goldstein 2000-5 18for (i=0; i<N; i++){AiBiCi}Dealing with exit conditionsi=0if (i >= N) goto doneA0B0if (i+1 == N) goto lasti=1A1if (i+2 == N) goto epilogi=2loop:AiBi-1Ci-2i++if (i < N) goto loopepilog:BiCi-1last:cidone:15-745 © Seth Copen Goldstein 2000-5 19Loop Unrolling V. SPFor SuperScalar• Loop Unrolling reduces loop overhead• Software Pipelining reduces fill/drain• Best is if you combine themSoftware PipeliningLoop Unrolling# ofoverlappediterationsTime15-745 © Seth Copen Goldstein 2000-5 20Aiken/Nicolau SchedulingStep 1Perform scalar replacement to eliminate memoryreferences where possible.for i:=1 to N do a := j " V[i-1] b := a " f c := e " j d := f " c e := b " d f := U[i] g: V[i] := b h: W[i] := d j := X[i]for i:=1 to N do a := j " b b := a " f c := e " j d := f " c e := b " d f := U[i] g: V[i] := b h: W[i] := d j := X[i]15-745 © Seth Copen Goldstein 2000-5 21Aiken/Nicolau SchedulingStep 2Unroll the loop and compute the data-dependencegraph (DDG).DDG for rolled loop:for i:=1 to N do a := j " b b := a " f c := e " j d := f " c e := b " d f := U[i] g: V[i] := b h: W[i] := d j := X[i]abcdefghj15-745 © Seth Copen Goldstein 2000-5 22Aiken/Nicolau SchedulingStep 2, cont’dDDG for unrolled loop:for i:=1 to N do a := j " b b := a " f c := e " j d := f " c e := b " d f := U[i] g: V[i] := b h: W[i] := d j := X[i]a1b1c1d1e1j1f1g1h1a2b2a3b3g2j2f2c2d2e2h1c315-745 © Seth Copen Goldstein 2000-5 23Aiken/Nicolau SchedulingStep 3Build a tableau of iteration number vs cycle time.acfj fj fj fj fj fjbdegh a cb dg a eh b cg a d b eh g a c b d g a eh b c g d eh iteration 1 2 3 4 5 6123456789101112131415cyclea1b1c1d1e1j1f1g1h1a2b2a3b3g2j2f2c2d2e2h1c315-745 © Seth Copen Goldstein 2000-5 24Aiken/Nicolau SchedulingStep 4Find repeating patterns of instructions.acfj fj fj fj fj fjbdegh a cb dg a eh b cg a d b eh g a c b d g a eh b c g d eh iteration 1 2 3 4 5 6123456789101112131415cycle15-745 © Seth Copen Goldstein 2000-5 25Aiken/Nicolau SchedulingStep 4Find repeating patterns of instructions.acfj fj fj fj fj fjbdegh a cb dg a eh b cg a d b eh g a c b d g a eh b c g d eh iteration 1 2 3 4 5 6123456789101112131415cycle15-745 © Seth Copen Goldstein 2000-5 26Aiken/Nicolau SchedulingStep 5“Coalesce” the
View Full Document