Rice CAAM 471 - Software Optimization

Unformatted text preview:

Software OptimizationPowerPoint PresentationSlide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21EECE 360 – Prof. SchamilogluThe University of New MexicoSoftware OptimizationVikas, ChaudharyMA 471EECE 360 – Prof. SchamilogluThe University of New MexicoHigh-level codeIntermediate code Object code ExecutableAssembly CodeAssemblerSteps to create an executableMA 471EECE 360 – Prof. SchamilogluThe University of New MexicoHigher OptimizationsProcedure within basic blocks.Procedure within single and nested loop structures.Entire procedure including all blocks and structures.File (inter-procedural analysis within a source file)Cross file (inter-procedural analysis across all procedures)MA 471EECE 360 – Prof. SchamilogluThe University of New MexicoCompiler OptionsThere are no strict rules about what each level of optimization means but generally O0 does one to many translations.O1 does basic block optimizations.O2 does loop optimizations.O4 does interfile optimizations.Some compilers also provide +odataprefetch to indicate that prefetch instructions should be inserted to prefetch data from memory to cache. MA 471EECE 360 – Prof. SchamilogluThe University of New MexicoIncreasing Register Pressure When too many registers are needed, compilers must store values to memory and restores values from memory. This degrades the performance. If we generate assembly code from compiler via –S and see that there is an inordinate number of load and store instructions then it is implied that compiler is generating too many spills. Use register data type carefully. MA 471EECE 360 – Prof. SchamilogluThe University of New MexicoDead Code EliminationDead code Elimination is merely the removal of code that is never used. i=0If (i!=0) deadcode(i); MA 471EECE 360 – Prof. SchamilogluThe University of New MexicoConstant Folding and Propagation Constant folding is when expressions with multiple constants are folded together and evaluated at compile time.a = 1+ 2 Will be replaced by a = 3.MA 471EECE 360 – Prof. SchamilogluThe University of New MexicoMA 471Common Subexpression eliminationCommon subexpression elimination analyzes lines of code, determines where identical subexpressions are used and creates a temporary variable to hold one instance of these values. a = b + (c + d)f = e + (c + d)EECE 360 – Prof. SchamilogluThe University of New MexicoMA 471Strength ReductionsStrength reduction means replacing expensive operations with cheaper ones.Replacing integer multiplication or division by constants with shift operations.Replacing 32-bit integer division by 64-bit floating point division.Replacing floating point multiplications by small constants with floating point additions. Replacing power function by floating point multiplications.EECE 360 – Prof. SchamilogluThe University of New MexicoMA 471Filling Branch Delay Slots Branch delay slots are the instructions after a branch that are always executed.If the compiler is used with no optimization, it will probably insert a nop into branch delay slot.EECE 360 – Prof. SchamilogluThe University of New MexicoMA 471Induction Variable Optimizationfor (i=0 ; i< n ; i +=2) ia[i] = i * k + m;Where i is induction variable.The above code can be replaced by ic = mfor (i = 0; i< n ; i += 2) { ia[i] = ic; ic = ic + k;}EECE 360 – Prof. SchamilogluThe University of New MexicoLoop Fission This technique is often used when an inner loop consists of a large number of lines and the compiler has difficulty generating code without spilling. This technique is also helpful in improving cache performance. for(i = 0; i < n; i++) Y[i] = y[i] + x[i] + x[i+m] Suppose x[i] and x[i +m] maps to same cache location. (Direct mapped cache). This will cause cache thrashing. MA 471EECE 360 – Prof. SchamilogluThe University of New MexicoLoop can be split asfor(i = 0; i < n; i++) y[i] = y[i] + x[i];for(i = 0; i < n; i++) y[i] = y[i] + x[i + m]; This technique might not be very useful when cache is n-way set associative. MA 471EECE 360 – Prof. SchamilogluThe University of New Mexico Loop Unrolling This technique reduces the effect of branches, instruction latency, and potentially the number of cache misses. Do I = 1, N Y(I) = X(I) ENDDOAfter UnrollingNEND = 4 * (N/4) Do I = I, N , 4 Y(I) = X(I) Y(I + 1) = X(I + 1) Y(I + 1) = X(I + 1)ENDDO Do I = NEND+1 , N Y(I) = X(I) ENDDO MA 471EECE 360 – Prof. SchamilogluThe University of New Mexico•Loading all the values of X before the values of Y reduces the possibility of cache thrashing.•Amount of unrolling can decrease the number of software prefetch instructions.•Excessive unrolling will cause data to be spilled from register to memory.•Unrolling increases size of object code, which might cause too many instruction cache misses. MA 471EECE 360 – Prof. SchamilogluThe University of New MexicoClock Cycles in an Unrolled LoopOriginal order CC Modified order CCLoad X(1) 1 Load X(1) 1Store Y(1) 7 Load X(2) 2Load X(2) 8 Load X(3) 3Store X(2) 14 Load X(4) 4Load X(3) 15 Store X(1) 7Store X(3) 21 Store X(2) 8Load X(4) 22 Store X(3) 9Store X(4) 28 Store X(4) 10MA 471EECE 360 – Prof. SchamilogluThe University of New Mexico Loop peeling This technique is used by compilers to handle boundary conditions. Do I = 1, N if(I .EQ. 1) then X[I] = 0 ELSEIF (I. EQ. N) THEN X(I) = N ELSE X(I) = X (I) + Y(I) ENDDO AFTER LOOP PEELING X(1) = 0 Do I = 2, N-1 X(I) = X(I) + Y(I) ENDDO X(N) = NMA 471EECE 360 – Prof. SchamilogluThe University of New Mexico Software Pipelining Software pipelining is a technique for recognizing loops such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop. Iteration 0Iteration 1Iteration 2Iteration 3MA 471EECE 360 – Prof. SchamilogluThe University of New Mexico Software pipeline is an optimization that is impossible to duplicate with high level code since the multiple assembly language instruction that a single line of high level language creates are


View Full Document

Rice CAAM 471 - Software Optimization

Download Software Optimization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Software Optimization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Software Optimization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?