DOC PREVIEW
UD CISC 672 - Introduction to Optimization

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Introduction to Optimization John Cavazos University of DelawareLecture Overview  Motivation  Loop TransformationsWhy study compiler optimizations? Moore’s Law  Chip density doubles every 18 months  Reflected in CPU performance doubling every 18 months Proebsting’s Law  Compilers double CPU performance every 18 years  4% improvement per year because of optimizations!Why study compiler optimizations? Corollary  1 year of code optimization research = 1 month of hardware improvements  No need for compiler research… Just wait a few months!Free Lunch is over Moore’s Law • Chip density doubles every 18 months Corollary • Cores will become simpler • Just wait a few months… Your code might get slower! • Many optimizations now being done by hand! (autotuning)Optimizations: The Big Picture What are our goals?  Simple Goal: Make execution time as small as possible Which leads to:  Achieve execution of many (all, in the best case) instructions in parallel  Find independent instructionsDependences  We will concentrate on data dependences  Simple example of data dependence: S1 PI = 3.14 S2 R = 5.0 S3 AREA = PI * R ** 2  Statement S3 cannot be moved before either S1 or S2 without compromising correct results S1 S2 S3Dependences  Formally: Data dependence from S1 to S2 (S2 depends on S1) if: 1. Both statements access same memory location and one of them stores onto it, and 2. There is a feasible execution path from S1 to S2Load Store Classification  Dependences classified in terms of load-store order: 1. True dependence (RAW hazard) 2. Antidependence (WAR hazard) 3. Output dependence (WAW hazard)Dependence in Loops  Let us look at two different loops: DO I = 1, N S1 A(I+1) = A(I)+ B(I) ENDDO DO I = 1, N S1 A(I+2) = A(I)+B(I) ENDDO • In both cases, statement S1 depends on itselfTransformations  We call a transformation safe if the transformed program has the same "meaning" as the original program  But, what is the "meaning" of a program? For our purposes:  Two programs are equivalent if, on the same inputs:  They produce the same outputs in the same orderLoop Transformations  Compilers have always focused on loops  Higher execution counts  Repeated, related operations  Much of real work takes place in loopsSeveral effects to attack  Overhead  Decrease control-structure cost per iteration  Locality  Spatial locality ⇒ use of co-resident data  Temporal locality ⇒ reuse of same data  Parallelism  Execute independent iterations of loop in parallelEliminating Overhead Loop unrolling (the oldest trick in the book)  To reduce overhead, replicate the loop body Sources of Improvement  Less overhead per useful operation  Longer basic blocks for local optimization do i = 1 to 100 by 1 a(i) = a(i) + b(i) end do i = 1 to 100 by 4 a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) end becomes (unroll by 4)Loop Fusion  Two loops over same iteration space ⇒ one loop  Safe if does not change the values used or defined by any statement in either loop (i.e., does not violate dependences) do i = 1 to n c(i) = a(i) + b(i) end do j = 1 to n d(j) = a(j) * e(j) end becomes (fuse) do i = 1 to n c(i) = a(i) + b(i) d(i) = a(i) * e(i) end For big arrays, a(i) may not be in the cache a(i) will be found in the cacheLoop Fusion Advantages  Enhance temporal locality  Reduce control overhead  Longer blocks for local optimization & scheduling  Can convert inter-loop reuse to intra-loop reuseLoop Fusion of Parallel Loops  Parallel loop fusion legal if dependences loop independent  Source and target of flow dependence map to same loop iteration  Each iteration can execute in parallelLoop distribution (fission)  Single loop with independent statements ⇒ multiple loops  Starts by constructing statement level dependence graph  Safe to perform distribution if:  No cycles in the dependence graph  Statements forming cycle in dependence graph put in same loopLoop distribution (fission) Has the following dependence graph (1) for I = 1 to N do (2) A[I] = A[i] + B[i-1] (3) B[I] = C[I-1]*X+C (4) C[I] = 1/B[I] (5) D[I] = sqrt(C[I]) (6) endforLoop distribution (fission) becomes (fission) (1) for I = 1 to N do (2) A[I] = A[i] + B[i-1] (3) B[I] = C[I-1]*X+C (4) C[I] = 1/B[I] (5) D[I] = sqrt(C[I]) (6) endfor (1) for I = 1 to N do (2) A[I] = A[i] + B[i-1] (3) endfor (4) for (5) B[I] = C[I-1]*X+C (6) C[I] = 1/B[I] (7) endfor (8) for (9) D[I] = sqrt(C[I]) (10) endfor21 Loop Fission Advantages  Enables other transformations  E.g., Vectorization  Resulting loops have smaller cache footprints  More reuse hits in the cacheLoop Tiling (blocking) Want to exploit temporal locality in loop nest.Loop Tiling (blocking)Loop Tiling (blocking)Loop Tiling (blocking)Loop Tiling (blocking)27 Loop Tiling Effects  Reduces volume of data between reuses  Works on one “tile” at a time (tile size is B by B)  Choice of tile size is crucialScalar Replacement  Allocators never keep c(i) in a register  We can trick the allocator by rewriting the references The plan  Locate patterns of consistent reuse  Make loads and stores use temporary scalar variable  Replace references with temporary’s name29 Scalar Replacement do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end end do i = 1 to n t = a(i) do j = 1 to n t = t + b(j) end a(i) = t end becomes (scalar replacement) Almost any register allocator can get t into a register30 Scalar Replacement Effects  Decreases number of loads and stores  Keeps reused values in names that can be allocated to registers  In essence, this exposes the reuse of a(i) to subsequent


View Full Document

UD CISC 672 - Introduction to Optimization

Documents in this Course
Syllabus

Syllabus

18 pages

Load more
Download Introduction to Optimization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Introduction to Optimization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Introduction to Optimization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?