Code OptimizationSeptember 27, 2001Topics• Machine-Independent Optimizations–Code motion–Reduction in strength–Common subexpression sharing• Tuning–Identifying performance bottlenecks15-213class10.pptCS 213 F’01– 2 –class10.pptGreat Reality #4There’s more to performance than asymptoticcomplexityConstant factors matter too!• easily see 10:1 performance range depending on how code is written• must optimize at multiple levels:–algorithm, data representations, procedures, and loopsMust understand system to optimize performance• how programs are compiled and executed• how to measure program performance and identify bottlenecks• how to improve performance without destroying code modularityand generalityCS 213 F’01– 3 –class10.pptOptimizing CompilersProvide efficient mapping of program to machine• register allocation• code selection and ordering• eliminating minor inefficienciesDon’t (usually) improve asymptotic efficiency• up to programmer to select best overall algorithm• big-O savings are (often) more important than constant factors–but constant factors also matterHave difficulty overcoming “optimization blockers”• potential memory aliasing• potential procedure side-effectsCS 213 F’01– 4 –class10.pptLimitations of Optimizing CompilersOperate Under Fundamental Constraint• Must not cause any change in program behavior under any possiblecondition• Often prevents it from making optimizations when would only affectbehavior under pathological conditions.Behavior that may be obvious to the programmer canbe obfuscated by languages and coding styles• e.g., data ranges may be more limited than variable types suggest–e.g., using an “int” in C for what could be an enumerated typeMost analysis is performed only within procedures• whole-program analysis is too expensive in most casesMost analysis is based only on static information• compiler has difficulty anticipating run-time inputsWhen in doubt, the compiler must be conservativeCS 213 F’01– 5 –class10.pptMachine-Independent Optimizations• Optimizations you should do regardless of processor / compilerCode Motion• Reduce frequency with which computation performed–If it will always produce same result–Especially moving code out of loopfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j];}CS 213 F’01– 6 –class10.pptCompiler-Generated Code Motion• Most compilers do a good job with array code + simple loopstructuresCode Generated by GCCfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; imull %ebx,%eax # i*n movl 8(%ebp),%edi # a leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4)# Inner Loop.L40: movl 12(%ebp),%edi # b movl (%edi,%ecx,4),%eax # b+j (scaled by 4) movl %eax,(%edx) # *p = b[j] addl $4,%edx # p++ (scaled by 4) incl %ecx # j++ jl .L40 # loop if j<nfor (i = 0; i < n; i++) { int ni = n*i; int *p = a+ni; for (j = 0; j < n; j++) *p++ = b[j];}CS 213 F’01– 7 –class10.pptReduction in Strength• Replace costly operation with simpler one• Shift, add instead of multiply or divide16*x --> x << 4–Utility machine dependent–Depends on cost of multiply or divide instruction–On Pentium II or III, integer multiply only requires 4 CPU cyclesfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];int ni = 0;for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n;}CS 213 F’01– 8 –class10.pptMake Use of Registers• Reading and writing registers much faster than reading/writingmemoryLimitation• Compiler not always able to determine whether variable can be heldin register• Possibility of Aliasing• See example laterCS 213 F’01– 9 –class10.pptMachine-Independent Opts. (Cont.)Share Common Subexpressions• Reuse portions of expressions• Compilers often not very sophisticated in exploiting arithmeticproperties/* Sum neighbors of i,j */up = val[(i-1)*n + j];down = val[(i+1)*n + j];left = val[i*n + j-1];right = val[i*n + j+1];sum = up + down + left + right;int inj = i*n + j;up = val[inj - n];down = val[inj + n];left = val[inj - 1];right = val[inj + 1];sum = up + down + left + right;3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n leal -1(%edx),%ecx # i-1 imull %ebx,%ecx # (i-1)*n leal 1(%edx),%eax # i+1 imull %ebx,%eax # (i+1)*n imull %ebx,%edx # i*nCS 213 F’01– 10 –class10.pptVector ADTProceduresvec_ptr new_vec(int len)–Create vector of specified lengthint get_vec_element(vec_ptr v, int index, int *dest)–Retrieve vector element, store at *dest–Return 0 if out of bounds, 1 if successfulint *get_vec_start(vec_ptr v)–Return pointer to start of vector data• Similar to array implementations in Pascal, ML, Java–E.g., always do bounds checkinglengthdata • • •0 1 2 length–1CS 213 F’01– 11 –class10.pptOptimization ExampleProcedure• Compute sum of all elements of vector• Store result at destination locationvoid combine1(vec_ptr v, int *dest){ int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; }}CS 213 F’01– 12 –class10.pptTime ScalesAbsolute Time• Typically use nanoseconds–10–9 seconds• Time scale of computer instructionsClock Cycles• Most computers controlled by high frequency clock signal• Typical Range–100 MHz»108 cycles per second»Clock period = 10ns–2 GHz»2 X 109 cycles per second»Clock period = 0.5ns• Fish machines: 550 MHz (1.8 ns clock period)CS 213 F’01– 13 –class10.pptCycles Per Element• Convenient way to express performance of program that operatorson vectors or lists• Length = n• T = CPE*n + Overhead010020030040050060070080090010000 50 100 150 200ElementsCyclesvsum1Slope = 4.0 vsum2Slope = 3.5CS 213 F’01– 14 –class10.pptOptimization ExampleProcedure• Compute sum of all elements of integer vector• Store result at destination location• Vector data structure and operations defined via abstract data typePentium II/III Performance: Clock Cycles / Element• 42.06 (Compiled -g) 31.25 (Compiled -O2)void combine1(vec_ptr v, int *dest){ int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; }}CS 213 F’01– 15 –class10.pptUnderstanding LoopInefficiency• Procedure vec_length
View Full Document