Code Optimization I September 22, 2006Harsh RealityOptimizing CompilersLimitations of Optimizing CompilersMachine-Independent OptimizationsCompiler-Generated Code MotionReduction in StrengthShare Common SubexpressionsOptimization Blocker: Procedure CallsMemory MattersMemory AliasingRemoving AliasingUnaliased VersionOptimization Blocker: Memory AliasingMachine-Independent Opt. SummaryCode Optimization ISeptember 22, 2006Code Optimization ISeptember 22, 2006TopicsTopicsMachine-Independent OptimizationsBasic optimizationsOptimization blockersclass08.ppt15-213“The course that gives CMU its Zip!”– 2 –15-213, F’06Harsh RealityHarsh RealityThere’s more to performance than asymptotic complexityThere’s more to performance than asymptotic complexityConstant factors matter too!Constant factors matter too!Easily see 10:1 performance range depending on how code is writtenMust optimize at multiple levels: algorithm, data representations, procedures, and loopsMust understand system to optimize performanceMust understand system to optimize performanceHow programs are compiled and executedHow to measure program performance and identify bottlenecksHow to improve performance without destroying code modularity and generality– 3 –15-213, F’06Optimizing CompilersOptimizing CompilersProvide efficient mapping of program to machineProvide efficient mapping of program to machineregister allocationcode selection and ordering (scheduling)dead code eliminationeliminating minor inefficienciesDon’t (usually) improve asymptotic efficiencyDon’t (usually) improve asymptotic efficiencyup to programmer to select best overall algorithmbig-O savings are (often) more important than constant factorsbut constant factors also matterHave difficulty overcoming “optimization blockers”Have difficulty overcoming “optimization blockers”potential memory aliasingpotential procedure side-effects– 4 –15-213, F’06Limitations of Optimizing CompilersLimitations of Optimizing CompilersOperate under fundamental constraintOperate under fundamental constraintMust not cause any change in program behavior under any possible conditionOften prevents it from making optimizations when would only affect behavior under pathological conditions.Behavior that may be obvious to the programmer can be Behavior that may be obvious to the programmer can be obfuscated by languages and coding stylesobfuscated by languages and coding stylese.g., Data ranges may be more limited than variable types suggestMost analysis is performed only within proceduresMost analysis is performed only within proceduresWhole-program analysis is too expensive in most casesMost analysis is based only on Most analysis is based only on staticstatic information informationCompiler has difficulty anticipating run-time inputsWhen in doubt, the compiler must be conservativeWhen in doubt, the compiler must be conservative– 5 –15-213, F’06Machine-Independent OptimizationsMachine-Independent OptimizationsOptimizations that you or the compiler should do Optimizations that you or the compiler should do regardless of processor / compilerregardless of processor / compilerCode MotionCode MotionReduce frequency with which computation performedIf it will always produce same resultEspecially moving code out of loop long j; int ni = n*i; for (j = 0; j < n; j++)a[ni+j] = b[j];void set_row(double *a, double *b, long i, long n){ long j; for (j = 0; j < n; j++)a[n*i+j] = b[j];}– 6 –15-213, F’06Compiler-Generated Code MotionCompiler-Generated Code Motionset_row:xorl %r8d, %r8d # j = 0cmpq %rcx, %r8 # j:njge .L7 # if >= goto donemovq %rcx, %rax # nimulq %rdx, %rax # n*i outside of inner loopleaq (%rdi,%rax,8), %rdx # rowp = A + n*i*8.L5: # loop:movq (%rsi,%r8,8), %rax # t = b[j]incq %r8 # j++movq %rax, (%rdx) # *rowp = taddq $8, %rdx # rowp++cmpq %rcx, %r8 # j:njl .L5 # if < goot loop.L7: # done:rep ; ret # return long j; long ni = n*i; double *rowp = a+ni; for (j = 0; j < n; j++)*rowp++ = b[j];void set_row(double *a, double *b, long i, long n){ long j; for (j = 0; j < n; j++)a[n*i+j] = b[j];}Where are the FP operations?– 7 –15-213, F’06Reduction in StrengthReduction in StrengthReplace costly operation with simpler oneShift, add instead of multiply or divide16*x --> x << 4Utility machine dependentDepends on cost of multiply or divide instructionOn Pentium IV, integer multiply requires 10 CPU cyclesRecognize sequence of productsfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];int ni = 0;for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n;}– 8 –15-213, F’06Share Common SubexpressionsShare Common SubexpressionsReuse portions of expressionsCompilers often not very sophisticated in exploiting arithmetic properties/* Sum neighbors of i,j */up = val[(i-1)*n + j ];down = val[(i+1)*n + j ];left = val[i*n + j-1];right = val[i*n + j+1];sum = up + down + left + right;int inj = i*n + j;up = val[inj - n];down = val[inj + n];left = val[inj - 1];right = val[inj + 1];sum = up + down + left + right;3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*nleaq 1(%rsi), %rax # i+1leaq -1(%rsi), %r8 # i-1imulq %rcx, %rsi # i*nimulq %rcx, %rax # (i+1)*nimulq %rcx, %r8 # (i-1)*naddq %rdx, %rsi # i*n+jaddq %rdx, %rax # (i+1)*n+jaddq %rdx, %r8 # (i-1)*n+jimulq %rcx, %rsi # i*naddq %rdx, %rsi # i*n+jmovq %rsi, %rax # i*n+jsubq %rcx, %rax # i*n+j-nleaq (%rsi,%rcx), %rcx # i*n+j+n– 15 –15-213, F’06Optimization Blocker: Procedure CallsOptimization Blocker: Procedure CallsWhy couldn’t compiler move Why couldn’t compiler move strlenstrlen out of inner loop? out of inner loop?Procedure may have side effectsAlters global state each time calledFunction may not return same value for given argumentsDepends on other parts of global stateProcedure lower could interact with strlenWarning:Warning:Compiler treats procedure call as a black boxWeak optimizations near themRemedies:Remedies:Use of inline functionsDo your own code motionint lencnt = 0;size_t strlen(const char
View Full Document