UT CS 429H - Code Optimization I- Machine Independent Optimizations

Unformatted text preview:

Code Optimization I:Machine Independent OptimizationsTopicsTopics Machine-Independent Optimizations Code motion Reduction in strength Common subexpression sharing Tuning Identifying performance bottlenecksSystems I2Great RealityThereThereʼʼs more to performance than asymptotics more to performance than asymptoticcomplexitycomplexityConstant factors matter too!Constant factors matter too! Easily see 10:1 performance range depending on how codeis written Must optimize at multiple levels: algorithm, data representations, procedures, and loopsMust understand system to optimize performanceMust understand system to optimize performance How programs are compiled and executed How to measure program performance and identifybottlenecks How to improve performance without destroying codemodularity and generality3Optimizing CompilersProvide efficient mapping of program to machineProvide efficient mapping of program to machine register allocation code selection and ordering eliminating minor inefficienciesDonDonʼʼt (usually) improve asymptotic efficiencyt (usually) improve asymptotic efficiency up to programmer to select best overall algorithm big-O savings are (often) more important than constantfactors but constant factors also matterHave difficulty overcoming Have difficulty overcoming ““optimization blockersoptimization blockers”” potential memory aliasing potential procedure side-effects4Limitations of Optimizing CompilersOperate Under Fundamental ConstraintOperate Under Fundamental Constraint Must not cause any change in program behavior under anypossible condition Often prevents it from making optimizations when would only affectbehavior under pathological conditions.Behavior that may be obvious to the programmer can beBehavior that may be obvious to the programmer can beobfuscated by languages and coding stylesobfuscated by languages and coding styles e.g., data ranges may be more limited than variable types suggestMost analysis is performed only within proceduresMost analysis is performed only within procedures whole-program analysis is too expensive in most casesMost analysis is based only on Most analysis is based only on staticstatic information information compiler has difficulty anticipating run-time inputsWhen in doubt, the compiler must be conservativeWhen in doubt, the compiler must be conservative5Machine-Independent Optimizations Optimizations you should do regardless of processor /compilerCode MotionCode Motion Reduce frequency with which computation performed If it will always produce same result Especially moving code out of loopfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j];}6Compiler-Generated Code Motion Most compilers do a good job with array code + simple loopstructuresCode Generated by GCCCode Generated by GCCfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; imull %ebx,%eax # i*n movl 8(%ebp),%edi # a leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4)# Inner Loop movl 12(%ebp),%edi # b.L40: movl (%edi,%ecx,4),%eax # b+j (scaled by 4) movl %eax,(%edx) # *p = b[j] addl $4,%edx # p++ (scaled by 4) incl %ecx # j++ cmpl %ebx,%ecx # loop if j<n jl .L40for (i = 0; i < n; i++) { int ni = n*i; int *p = a+ni; for (j = 0; j < n; j++) *p++ = b[j];}7Reduction in Strength Replace costly operation with simpler one Shift, add instead of multiply or divide16*x --> x << 4 Utility machine dependent Depends on cost of multiply or divide instruction On Pentium II or III, integer multiply only requires 4 CPU cycles Recognize sequence of productsfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];int ni = 0;for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n;}8Make Use of Registers Reading and writing registers much faster thanreading/writing memoryLimitationLimitation Compiler not always able to determine whether variable canbe held in register Possibility of Aliasing See example later9Machine-Independent Opts. (Cont.)Share Common Share Common SubexpressionsSubexpressions Reuse portions of expressions Compilers often not very sophisticated in exploitingarithmetic properties/* Sum neighbors of i,j */up = val[(i-1)*n + j];down = val[(i+1)*n + j];left = val[i*n + j-1];right = val[i*n + j+1];sum = up + down + left + right;int inj = i*n + j;up = val[inj - n];down = val[inj + n];left = val[inj - 1];right = val[inj + 1];sum = up + down + left + right;3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n leal -1(%edx),%ecx # i-1 imull %ebx,%ecx # (i-1)*n leal 1(%edx),%eax # i+1 imull %ebx,%eax # (i+1)*n imull %ebx,%edx # i*n10Time ScalesAbsolute TimeAbsolute Time Typically use nanoseconds 10–9 seconds Time scale of computer instructionsClock CyclesClock Cycles Most computers controlled by high frequency clock signal Typical Range 100 MHz» 108 cycles per second» Clock period = 10ns 2 GHz» 2 X 109 cycles per second» Clock period = 0.5ns11Example of PerformanceMeasurementLoop unrollingLoop unrolling Assume even number of elementsvoid vsum1(int n) { int i; for(i=0; i<n; i++) c[i] = a[i] + b[i];}void vsum2(int n) { int i; for(i=0; i<n; i+=2) { c[i] = a[i] + b[i]; c[i+1] = a[i+1] + b[i+1];}12Cycles Per Element Convenient way to express performance of program thatoperators on vectors or lists Length = n T = CPE*n + Overhead010020030040050060070080090010000 50 100 150 200ElementsCyclesvsum1Slope = 4.0 vsum2Slope = 3.513void lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}Code Motion ExampleProcedure to Convert String to Lower CaseProcedure to Convert String to Lower Case14Lower Case Conversion Performance Time quadruples when string length doubles Quadratic performancelower10.00010.0010.010.11101001000256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144String LengthCPU Seconds15Convert Loop To Goto Form strlen executed every iteration strlen linear in length of string Must scan string until finds '\0' Overall performance is quadraticvoid lower(char *s){ int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++;


View Full Document

UT CS 429H - Code Optimization I- Machine Independent Optimizations

Download Code Optimization I- Machine Independent Optimizations
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Code Optimization I- Machine Independent Optimizations and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Code Optimization I- Machine Independent Optimizations 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?