Unformatted text preview:

CSE 5317Lecture 22: Loop optimizations8 April 2010Nate NystromUTACredits: John Cavazos, University of DelawareJeremy Johnson, Drexel UniversityTuesday, May 4, 2010Lecture Overview• Very Brief Introduction to Dependences • Loop TransformationsTuesday, May 4, 2010The Big Picture• What are our goals?• Simple goals: Make execution time as small as possible• Which leads to:• Achieve execution of many (all, in the best case) instructions in parallel• Find independent instructions Tuesday, May 4, 2010Dependences•We will concentrate on data dependences• Simple example of data dependence:S1: pi = 3.14;S2: r = 5.0;S3: area = pi * r * r;• Statement S3 cannot be moved before either S1 or S2 without compromising correct resultsTuesday, May 4, 2010DependencesFormally:•There is a data dependence from statement S1 to statement S2 (S2 depends on S1) if: 1.Both statements access the same memory location and at least one of them writes into it, and2.There is a feasible run-time execution path from S1 to S2Tuesday, May 4, 2010Load Store Classification• Dependences classified in terms of load-store order:• True dependence (“read after write” = RAW hazard)• Antidependence (WAR hazard)• Output dependence (WAW hazard)Tuesday, May 4, 2010Examples• True dependencea = 3b = a * 2• Anti dependencea = b * 3b = 8• Output dependencea = b * ca = e + fTuesday, May 4, 2010• Consider this loop:•Iteration i+1 reads the element of A value written by iteration i•S1 is a loop-carried dependenceDependence in Loopsfor (i = 0; i < N-1; i++) A[i+1] = A[i] + B[i]; // S1Tuesday, May 4, 2010Transformations• Loops (and other instructions) can be transformed as long as order of dependences does not changeTuesday, May 4, 2010Reordering Transformations• Any program transformation that changes the order of execution of the code, without adding or removing any executions of any statements• Does not eliminate dependences• But, it can change the ordering of the dependence, which will lead to incorrect behavior• Preserves a dependence if it preserves the relative execution order of the source and sink of that dependenceTuesday, May 4, 2010Loop Transformations• Compilers have always focused on loops• Higher execution counts• Repeated, related operations• Much of real work takes place in loopsTuesday, May 4, 2010Several effects to attack• Overhead• Decrease control-structure cost per iteration • Locality • Spatial locality 󲰛 use of co-resident data• Temporal locality 󲰛 reuse of same data• Parallelism• Execute independent iterations of loop in parallelTuesday, May 4, 2010Eliminating Overhead• Loop unrolling • To reduce overhead, replicate the loop body• Sources of Improvement• Less overhead per useful operation• Longer basic blocks for local optimizationfor (i = 0; i < 100; i++) { A[i] = A[i] + B[i];}for (i = 0; i < 100; i+=4) { A[i] = A[i] + B[i]; A[i+1] = A[i+1] + B[i+1]; A[i+2] = A[i+2] + B[i+2]; A[i+3] = A[i+3] + B[i+3];}Tuesday, May 4, 2010Eliminating Overhead• Loop unrolling with unknown bounds • Generate guard loopsfor (i = 0; i+4 < n; i += 4) { A[i] = A[i] + B[i]; A[i+1] = A[i+1] + B[i+1]; A[i+2] = A[i+2] + B[i+2]; A[i+3] = A[i+3] + B[i+3];} for (; i < n; i++) { A[i] = A[i] + B[i];}Tuesday, May 4, 2010Loop Unswitching• Hoist invariant control-flow out of loop nest• Replicate the loop & specialize it• No tests, branches in loop body• Longer segments of straight-line code*Tuesday, May 4, 2010Loop Unswitchingfor (...) { S0; if (c) S1; else S2; S3;}if (c) for (...) { S0; S1; S3; }else for (...) { S0; S2; S3; }Tuesday, May 4, 2010Loop Unswitchingfor (i = 0; i < N; i++) { A[i] = A[i] + B[i]; if (c) D[i] = 0;}if (c) { for (i = 0; i < N; i++) { A[i] = A[i] + B[i]; D[i] = 0; }}else { for (i = 0; i < N; i++) { A[i] = A[i] + B[i]; }}Tuesday, May 4, 2010Loop fusionfor (i = 0; i < N; i++) { C[i] = A[i] + B[i];}for (j = 0; j < N; i++) { D[j] = A[j] * C[j];}for (i = 0; i < N; i++) { C[i] = A[i] + B[i]; D[i] = A[i] * C[i];}For big arrays, A[j] may not be in the cacheA[i] will be in the cache• Two loops over same iteration space 󲰛 one loop• Safe if does not change the values used or defined by any statement in either loop (i.e., does not violate dependences)Tuesday, May 4, 2010Loop Fusion Advantages• Enhance temporal locality• Reduce control overhead• Longer blocks for local optimization & scheduling• Can convert inter-loop reuse to intra-loop reuse*Tuesday, May 4, 2010Loop distribution (fission)• Single loop with independent statements 󲰛 multiple loops• Starts by constructing statement level dependence graph• Safe to perform distribution if:• No cycles in the dependence graph• Statements forming cycle in dependence graph put in same loopTuesday, May 4, 2010Loop distribution (fission)for (i = 0; i < N; i++) { C[i] = A[i] + B[i];}for (j = 0; j < N; i++) { F[j] = D[j] * E[j];}for (i = 0; i < N; i++) { C[i] = A[i] + B[i]; F[i] = D[i] * E[i];}Tuesday, May 4, 2010Loop distribution (fission)Has the following dependence graph(1) for (i = 1; i < N; i++) {(2) A[i] = A[i] + B[i-1](3) B[i] = C[i-1]*X+C(4) C[i] = 1/B[i](5) D[i] = sqrt(C[i])(6) }Tuesday, May 4, 2010Loop distribution (fission)(1) for (i = 1; i < N; i++) {(2) A[i] = A[i] + B[i-1](3) B[i] = C[i-1]*X+C(4) C[i] = 1/B[i](5) D[i] = sqrt(C[i])(6) }for (i = 1; i < N; i++) { A[i] = A[i] + B[i-1]}for (i = 1; i < N; i++) { B[i] = C[i-1]*X+C C[i] = 1/B[i]}for (i = 1; i < N; i++) { D[i] = sqrt(C[i])}Tuesday, May 4, 201026Loop Fission Advantages• Enables other transformations ! !• e.g., vectorization• Resulting loops have smaller cache footprints • More reuse hits in the cacheTuesday, May 4, 201027Loop Interchangefor (i = 1; i < 50; i++) for (j = 0; j < 100; j++) A[i][j] = B[i][j] * C[i][j];for (j = 0; j < 100; j++) for (i = 1; i < 50; i++) A[i][j] = B[i][j] * C[i][j];Swap inner & outer loops to rearrange iteration spaceTuesday, May 4, 2010• If one loop carries all dependence relations• Swap to outermost loop and all inner loops executed in parallel• If outer loops iterates many times and inner only a few• Swap outer and inner loops to reduce startup overhead• Improves reuse by using more elements per cache line• Goal is to get as much reuse into inner loop as possible28Loop


View Full Document

UT Arlington CSE 5317 - Loop optimizations

Download Loop optimizations
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Loop optimizations and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Loop optimizations 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?