Code Optimization I Nov. 25, 2008Harsh RealityOptimizing CompilersLimitations of Optimizing CompilersMachine-Independent OptimizationsCompiler-Generated Code MotionReduction in StrengthShare Common SubexpressionsOptimization Blocker #1: Procedure CallsLower Case Conversion PerformanceConvert Loop To Goto FormCalling StrlenImproving PerformanceSlide 14Optimization Blocker: Procedure CallsMemory MattersMemory AliasingRemoving AliasingUnaliased VersionOptimization Blocker: Memory AliasingMachine-Independent Opt. SummaryCode Optimization INov. 25, 2008Code Optimization INov. 25, 2008TopicsTopicsMachine-Independent OptimizationsBasic optimizationsOptimization blockersclass25.ppt15-213“The course that gives CMU its Zip!”– 2 –15-213, F’08Harsh RealityHarsh RealityThere’s more to performance than asymptotic complexityThere’s more to performance than asymptotic complexityConstant factors matter too!Constant factors matter too!Easily see 10:1 performance range depending on how code is writtenMust optimize at multiple levels: algorithm, data representations, procedures, and loopsMust understand system to optimize performanceMust understand system to optimize performanceHow programs are compiled and executedHow to measure program performance and identify bottlenecksHow to improve performance without destroying code modularity and generality– 3 –15-213, F’08Optimizing CompilersOptimizing CompilersProvide efficient mapping of program to machineProvide efficient mapping of program to machineregister allocationcode selection and ordering (scheduling)dead code eliminationeliminating minor inefficienciesDon’t (usually) improve asymptotic efficiencyDon’t (usually) improve asymptotic efficiencyup to programmer to select best overall algorithmbig-O savings are (often) more important than constant factorsbut constant factors also matterHave difficulty overcoming “optimization blockers”Have difficulty overcoming “optimization blockers”potential memory aliasingpotential procedure side-effects– 4 –15-213, F’08Limitations of Optimizing CompilersLimitations of Optimizing CompilersOperate under fundamental constraintOperate under fundamental constraintMust not cause any change in program behavior under any possible conditionOften prevents it from making optimizations when would only affect behavior under pathological conditions.Behavior that may be obvious to the programmer can be Behavior that may be obvious to the programmer can be obfuscated by languages and coding stylesobfuscated by languages and coding stylese.g., Data ranges may be more limited than variable types suggestMost analysis is performed only within proceduresMost analysis is performed only within proceduresWhole-program analysis is too expensive in most casesMost analysis is based only on Most analysis is based only on staticstatic information informationCompiler has difficulty anticipating run-time inputsWhen in doubt, the compiler must be conservativeWhen in doubt, the compiler must be conservative– 5 –15-213, F’08Machine-Independent OptimizationsMachine-Independent OptimizationsOptimizations that you or the compiler should do Optimizations that you or the compiler should do regardless of processor / compilerregardless of processor / compilerCode MotionCode MotionReduce frequency with which computation performedIf it will always produce same resultEspecially moving code out of loop long j; int ni = n*i; for (j = 0; j < n; j++)a[ni+j] = b[j];void set_row(double *a, double *b, long i, long n){ long j; for (j = 0; j < n; j++)a[n*i+j] = b[j];}– 6 –15-213, F’08Compiler-Generated Code MotionCompiler-Generated Code Motionset_row:xorl %r8d, %r8d # j = 0cmpq %rcx, %r8 # j:njge .L7 # if >= goto donemovq %rcx, %rax # nimulq %rdx, %rax # n*i outside of inner loopleaq (%rdi,%rax,8), %rdx # rowp = A + n*i*8.L5: # loop:movq (%rsi,%r8,8), %rax # t = b[j]incq %r8 # j++movq %rax, (%rdx) # *rowp = taddq $8, %rdx # rowp++cmpq %rcx, %r8 # j:njl .L5 # if < goot loop.L7: # done:rep ; ret # return long j; long ni = n*i; double *rowp = a+ni; for (j = 0; j < n; j++)*rowp++ = b[j];void set_row(double *a, double *b, long i, long n){ long j; for (j = 0; j < n; j++)a[n*i+j] = b[j];}Where are the FP operations?– 7 –15-213, F’08Reduction in StrengthReduction in StrengthReplace costly operation with simpler oneShift, add instead of multiply or divide16*x --> x << 4Utility machine dependentDepends on cost of multiply or divide instruction»On Pentium IV, integer multiply requires 10 CPU cycles»On Core 2, requires 3 cyclesRecognize sequence of productsfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];int ni = 0;for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n;}– 8 –15-213, F’08Share Common SubexpressionsShare Common SubexpressionsReuse portions of expressionsCompilers often not very sophisticated in exploiting arithmetic properties/* Sum neighbors of i,j */up = val[(i-1)*n + j ];down = val[(i+1)*n + j ];left = val[i*n + j-1];right = val[i*n + j+1];sum = up + down + left + right;int inj = i*n + j;up = val[inj - n];down = val[inj + n];left = val[inj - 1];right = val[inj + 1];sum = up + down + left + right;3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*nleaq 1(%rsi), %rax # i+1leaq -1(%rsi), %r8 # i-1imulq %rcx, %rsi # i*nimulq %rcx, %rax # (i+1)*nimulq %rcx, %r8 # (i-1)*naddq %rdx, %rsi # i*n+jaddq %rdx, %rax # (i+1)*n+jaddq %rdx, %r8 # (i-1)*n+jimulq %rcx, %rsi # i*naddq %rdx, %rsi # i*n+jmovq %rsi, %rax # i*n+jsubq %rcx, %rax # i*n+j-nleaq (%rsi,%rcx), %rcx # i*n+j+n– 9 –15-213, F’08void lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}Optimization Blocker #1: Procedure CallsOptimization Blocker #1: Procedure CallsProcedure to Convert String to Lower CaseProcedure to Convert String to Lower CaseExtracted from 213 lab submissions, Fall, 1998– 10 –15-213, F’08Lower Case Conversion PerformanceLower Case Conversion PerformanceTime quadruples when
View Full Document