Code OptimizationOctober 3, 2000Topics• Machine-Independent Optimizations–Code motion–Reduction in strength–Common subexpression sharing• Machine-Dependent Optimizations–Pointer code–Unrolling–Enabling instruction level parallelism• Advice15-213class11.pptCS 213 F’00– 2 –class11.pptGreat Reality #4There’s more to performance than asymptoticcomplexityConstant factors matter too!• easily see 10:1 performance range depending on how code is written• must optimize at multiple levels:–algorithm, data representations, procedures, and loopsMust understand system to optimize performance• how programs are compiled and executed• how to measure program performance and identify bottlenecks• how to improve performance without destroying code modularityand generalityCS 213 F’00– 3 –class11.pptOptimizing CompilersProvide efficient mapping of program to machine• register allocation• code selection and ordering• eliminating minor inefficienciesDon’t (usually) improve asymptotic efficiency• up to programmer to select best overall algorithm• big-O savings are (often) more important than constant factors–but constant factors also matterHave difficulty overcoming “optimization blockers”• potential memory aliasing• potential procedure side-effectsCS 213 F’00– 4 –class11.pptLimitations of Optimizing CompilersOperate Under Fundamental Constraint• Must not cause any change in program behavior under any possiblecondition• Often prevents it from making optimizations when would only affectbehavior under pathological conditions.Behavior that may be obvious to the programmer canbe obfuscated by languages and coding styles• e.g., data ranges may be more limited than variable types suggest–e.g., using an “int” in C for what could be an enumerated typeMost analysis is performed only within procedures• whole-program analysis is too expensive in most casesMost analysis is based only on static information• compiler has difficulty anticipating run-time inputsWhen in doubt, the compiler must be conservativeCS 213 F’00– 5 –class11.pptMachine-Independent Optimizations• Optimizations you should do regardless of processor / compilerCode Motion• Reduce frequency with which computation performed–If it will always produce same result–Especially moving code out of loopfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j];}CS 213 F’00– 6 –class11.pptMachine-Independent Opts. (Cont.)Reductions in Strength:• Replace costly operation with simpler one• Shift, add instead of multiply or divide16*x --> x << 4–Utility machine dependent–Depends on cost of multiply or divide instruction–On Pentium II or III, integer multiply only requires 4 CPU cycles• Keep data in registers rather than memory–Compilers have trouble making this optimizationfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];int ni = 0;for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n;}CS 213 F’00– 7 –class11.pptMachine-Independent Opts. (Cont.)Share Common Subexpressions• Reuse portions of expressions• Compilers often not very sophisticated in exploiting arithmeticproperties/* Sum neighbors of i,j */up = val[(i-1)*n + j];down = val[(i+1)*n + j];left = val[i*n + j-1];right = val[i*n + j+1];sum = up + down + left + right;int inj = i*n + j;up = val[inj - n];down = val[inj + n];left = val[inj - 1];right = val[inj + 1];sum = up + down + left + right;3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*nCS 213 F’00– 8 –class11.pptImportant ToolsMeasurement• Accurately compute time taken by code–Most modern machines have built in cycle counters• Profile procedure calling frequencies–Unix tool gprof• Custom-built tools–E.g., L4 cache simulatorObservation• Generating assembly code–Lets you see what optimizations compiler can make–Understand capabilities/limitations of particular compilerCS 213 F’00– 9 –class11.pptOptimization ExampleProcedure• Compute sum of all elements of integer vector• Store result at destination location• Vector data structure and operations defined via abstract data typePentium II/III Performance: Clock Cycles / Element• 40.3 (Compiled -g) 28.6 (Compiled -O2)void combine1(vec_ptr v, int *dest){ int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; }}CS 213 F’00– 10 –class11.pptVector ADTProceduresvec_ptr new_vec(int len)–Create vector of specified lengthint get_vec_element(vec_ptr v, int index, int *dest)–Retrieve vector element, store at *dest–Return 0 if out of bounds, 1 if successfulint *get_vec_start(vec_ptr v)–Return pointer to start of vector data• Similar to array implementations in Pascal, ML, Java–E.g., always do bounds checkinglengthdata • • •0 1 2 length–1CS 213 F’00– 11 –class11.pptUnderstanding LoopInefficiency• Procedure vec_length called every iteration• Even though result always the samevoid combine1-goto(vec_ptr v, int *dest){ int i = 0; int val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done:}1 iterationCS 213 F’00– 12 –class11.pptMove vec_length Call Out of LoopOptimization• Move call to vec_length out of inner loop–Value does not change from one iteration to next–Code motion• CPE: 20.2 (Compiled -O2)– vec_length requires only constant time, but significant overheadvoid combine2(vec_ptr v, int *dest){ int i; int len = vec_length(v); *dest = 0; for (i = 0; i < len; i++) { int val; get_vec_element(v, i, &val); *dest += val; }}CS 213 F’00– 13 –class11.pptvoid lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}Code Motion Example #2Procedure to Convert String to Lower CaseOriginal0.00010.01110010000256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144CPU SecondsCPU time quadruples every time double string lengthCS 213 F’00– 14 –class11.pptConvert Loop To Goto Form• strlen executed every iteration• strlen linear in length of string–Must scan string until finds ‘\0’• Overall performance is quadraticvoid lower(char *s){ int i = 0; if (i >= strlen(s)) goto
View Full Document