Page 1Code Optimization I:Machine Independent OptimizationsFeb 11, 2003TopicsMachine-Independent OptimizationszCode motionzStrength Reduction/Induction Var ElimzCommon subexpression sharingTuningzIdentifying performance bottlenecksclass10.ppt15-213“The course that gives CMU its Zip!”–2–15-213, S’03Great Reality #4There’s more to performance than asymptotic complexityConstant factors matter too!Easily see 10:1 performance range depending on how code is writtenMust optimize at multiple levels: zalgorithm, data representations, procedures, and loopsMust understand system to optimize performanceHow programs are compiled and executedHow to measure program performance and identify bottlenecksHow to improve performance without destroying code modularity and generality–3–15-213, S’03Optimizing CompilersProvide efficient mapping of program to machineregister allocationcode selection and orderingeliminating minor inefficienciesDon’t (usually) improve asymptotic efficiencyup to programmer to select best overall algorithmbig-O savings are (often) more important than constant factorszbut constant factors also matterHave difficulty overcoming “optimization blockers”potential memory aliasingpotential procedure side-effects–4–15-213, S’03Limitations of Optimizing CompilersOperate under fundamental constraintMust not cause any change in program behavior under any possible conditionOften prevents it from making optimizations when would only affect behavior under pathological conditions.Behavior that may be obvious to the programmer can be obfuscated by languages and coding stylese.g., data ranges may be narrower than var types suggestMost analysis is performed only within procedureswhole-program analysis is too expensive in most casesMost analysis is based only on staticinformationcompiler has difficulty anticipating run-time inputsThe Bottom Line:When in doubt, do nothingi.e., The compiler must be conservative.Page 2–5–15-213, S’03Machine-Independent OptimizationsOptimizations that should be done regardless of processor / compilerCode MotionReduce frequency with which computation performedzIf it will always produce same resultzEspecially moving code out of loopfor (i = 0; i < n; i++)for (j = 0; j < n; j++)a[n*i + j] = b[j];for (i = 0; i < n; i++) {int ni = n*i;for (j = 0; j < n; j++)a[ni + j] = b[j];}–6–15-213, S’03Compiler-Generated Code MotionMost compilers do a good job with array code + simple loop structuresCode Generated by GCCfor (i = 0; i < n; i++)for (j = 0; j < n; j++)a[n*i + j] = b[j];imull %ebx,%eax # i*nmovl 8(%ebp),%edi # aleal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4)# Inner Loop.L40:movl 12(%ebp),%edi # bmovl (%edi,%ecx,4),%eax # b+j (scaled by 4)movl %eax,(%edx) # *p = b[j]addl $4,%edx # p++ (scaled by 4)incl %ecx # j++jl .L40 # loop if j<nfor (i = 0; i < n; i++) {int ni = n*i;int *p = a+ni;for (j = 0; j < n; j++)*p++ = b[j];}–7–15-213, S’03Strength Reduction†Replace costly operation with simpler oneShift, add instead of multiply or divide16*x →→→→ x << 4zUtility machine dependentzDepends on cost of multiply or divide instructionzOn Pentium II or III, integer multiply only requires 4 CPU cyclesRecognize sequence of products (induction var analysis)for (i = 0; i < n; i++)for (j = 0; j < n; j++)a[n*i + j] = b[j];int ni = 0;for (i = 0; i < n; i++) {for (j = 0; j < n; j++)a[ni + j] = b[j];ni += n;}†As a result of Induction Variable Elimination–8–15-213, S’03Make Use of RegistersReading and writing registers much faster than reading/writing memoryLimitationLimited number of registersCompiler cannot always determine whether variable can be held in registerPossibility of AliasingSee example laterPage 3–9–15-213, S’03Machine-Independent Opts. (Cont.)Share Common Subexpressions†Reuse portions of expressionsCompilers often not very sophisticated in exploiting arithmetic properties/* Sum neighbors of i,j */up = val[(i-1)*n + j];down = val[(i+1)*n + j];left = val[i*n + j-1];right = val[i*n + j+1];sum = up + down + left + right;int inj = i*n + j;up = val[inj - n];down = val[inj + n];left = val[inj - 1];right = val[inj + 1];sum = up + down + left + right;3 multiplies: i*n, (i–1)*n, (i+1)*n1 multiply: i*nleal -1(%edx),%ecx# i-1imull %ebx,%ecx # (i-1)*nleal 1(%edx),%eax # i+1imull %ebx,%eax # (i+1)*nimull %ebx,%edx # i*n†AKA: Common Subexpression Elimination (CSE)–10–15-213, S’03Measuring Performance: Time ScalesAbsolute TimeTypically use nanosecondsz10–9secondsTime scale of computer instructionsClock CyclesMost computers controlled by high frequency clock signalTypical Rangez100 MHz »108cycles per second»Clock period = 10nsFish machines: 550 MHz (1.8 ns clock period)z2 GHz »2 X 109cycles per second» Clock period = 0.5ns–11–15-213, S’03Measuring PerformanceFor many programs, cycles per element (CPE)Especially true of programs that work on lists/vectorsTotal time = fixed overhead + CPE * length-of-listvoid vsum1(int n){int i;for (i = 0; i<n; i++)c[i] = a[i] + b[i];}void vsum2(int n){int i;for (i = 0; i<n; i+=2)c[i] = a[i] + b[i];c[i+1] = a[i+1] + b[i+1];}• vsum2 only works on even n.• vsum2 is an example of loop unrolling.–12–15-213, S’03Cycles Per ElementConvenient way to express performance of a program that operates on vectors or listsLength = nT = CPE*n + Overhead010020030040050060070080090010000 50 100 150 200vsum1Slope = 4.0vsum2Slope = 3.5CyclesNumber of ElementsPage 4–13–15-213, S’03Vector ADTProceduresvec_ptr new_vec(int len)zCreate vector of specified lengthint get_vec_element(vec_ptr v, int index, int *dest)zRetrieve vector element, store at *destzReturn 0 if out of bounds, 1 if successfulint *get_vec_start(vec_ptr v)zReturn pointer to start of vector dataint vec_length(v)(vec_ptr v)zReturn length of vectorSimilar to array implementations in Pascal, ML, JavazE.g., always do bounds checkinglengthdata••••••••••••0 1 2 length–1–14–15-213, S’03Optimization ExampleProcedureCompute sum of all elements of vectorStore result at destination locationvoid combine1(vec_ptr v, int *dest){int i;*dest = 0;for (i = 0; i < vec_length(v); i++) {int val;get_vec_element(v, i, &val);*dest += val;}}–15–15-213, S’03Optimization ExampleProcedureCompute sum of all elements of integer vectorStore
View Full Document