Code Optimization I:Machine Independent OptimizationsSept. 26, 2002Code Optimization I:Machine Independent OptimizationsSept. 26, 2002TopicsTopicsn Machine-Independent Optimizationsl Code motionl Reduction in strengthl Common subexpression sharingn Tuningl Identifying performance bottlenecksclass10.ppt15-213“The course that gives CMU its Zip!”– 2 –15-213, F’02Great Reality #4Great Reality #4There’s more to performance than asymptoticThere’s more to performance than asymptoticcomplexitycomplexityConstant factors matter too!Constant factors matter too!n Easily see 10:1 performance range depending on how codeis writtenn Must optimize at multiple levels:l algorithm, data representations, procedures, and loopsMust understand system to optimize performanceMust understand system to optimize performancen How programs are compiled and executedn How to measure program performance and identifybottlenecksn How to improve performance without destroying codemodularity and generality– 3 –15-213, F’02Optimizing CompilersOptimizing CompilersProvide efficient mapping of program to machineProvide efficient mapping of program to machinen register allocationn code selection and orderingn eliminating minor inefficienciesDon’t (usually) improve asymptotic efficiencyDon’t (usually) improve asymptotic efficiencyn up to programmer to select best overall algorithmn big-O savings are (often) more important than constantfactorsl but constant factors also matterHave difficulty overcoming “optimization blockers”Have difficulty overcoming “optimization blockers”n potential memory aliasingn potential procedure side-effects– 4 –15-213, F’02Limitations of Optimizing CompilersLimitations of Optimizing CompilersOperate Under Fundamental ConstraintOperate Under Fundamental Constraintn Must not cause any change in program behavior under anypossible conditionn Often prevents it from making optimizations when would only affectbehavior under pathological conditions.Behavior that may be obvious to the programmer can beBehavior that may be obvious to the programmer can beobfuscated by languages and coding stylesobfuscated by languages and coding stylesn e.g., data ranges may be more limited than variable types suggestMost analysis is performed only within proceduresMost analysis is performed only within proceduresn whole-program analysis is too expensive in most casesMost analysis is based only on Most analysis is based only on staticstatic information informationn compiler has difficulty anticipating run-time inputsWhen in doubt, the compiler must be conservativeWhen in doubt, the compiler must be conservative– 5 –15-213, F’02Machine-Independent OptimizationsMachine-Independent Optimizationsn Optimizations you should do regardless of processor /compilerCode MotionCode Motionn Reduce frequency with which computation performedl If it will always produce same resultl Especially moving code out of loopfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j];}– 6 –15-213, F’02Compiler-Generated Code MotionCompiler-Generated Code Motionn Most compilers do a good job with array code + simple loopstructuresCode Generated by GCCCode Generated by GCCfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; imull %ebx,%eax # i*n movl 8(%ebp),%edi # a leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4)# Inner Loop.L40: movl 12(%ebp),%edi # b movl (%edi,%ecx,4),%eax # b+j (scaled by 4) movl %eax,(%edx) # *p = b[j] addl $4,%edx # p++ (scaled by 4) incl %ecx # j++ jl .L40 # loop if j<nfor (i = 0; i < n; i++) { int ni = n*i; int *p = a+ni; for (j = 0; j < n; j++) *p++ = b[j];}– 7 –15-213, F’02Reduction in StrengthReduction in Strengthn Replace costly operation with simpler onen Shift, add instead of multiply or divide16*x --> x << 4l Utility machine dependentl Depends on cost of multiply or divide instructionl On Pentium II or III, integer multiply only requires 4 CPU cyclesn Recognize sequence of productsfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];int ni = 0;for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n;}– 8 –15-213, F’02Make Use of RegistersMake Use of Registersn Reading and writing registers much faster thanreading/writing memoryLimitationLimitationn Compiler not always able to determine whether variable canbe held in registern Possibility of Aliasingn See example later– 9 –15-213, F’02Machine-Independent Opts. (Cont.)Machine-Independent Opts. (Cont.)Share Common Share Common SubexpressionsSubexpressionsn Reuse portions of expressionsn Compilers often not very sophisticated in exploitingarithmetic properties/* Sum neighbors of i,j */up = val[(i-1)*n + j];down = val[(i+1)*n + j];left = val[i*n + j-1];right = val[i*n + j+1];sum = up + down + left + right;int inj = i*n + j;up = val[inj - n];down = val[inj + n];left = val[inj - 1];right = val[inj + 1];sum = up + down + left + right;3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n leal -1(%edx),%ecx # i-1 imull %ebx,%ecx # (i-1)*n leal 1(%edx),%eax # i+1 imull %ebx,%eax # (i+1)*n imull %ebx,%edx # i*n– 10 –15-213, F’02Vector ADTVector ADTProceduresProceduresvec_ptr new_vec(int len)l Create vector of specified lengthint get_vec_element(vec_ptr v, int index, int *dest)l Retrieve vector element, store at *destl Return 0 if out of bounds, 1 if successfulint *get_vec_start(vec_ptr v)l Return pointer to start of vector datan Similar to array implementations in Pascal, ML, Javal E.g., always do bounds checkinglengthdata • • •0 1 2 length–1– 11 –15-213, F’02Optimization ExampleOptimization ExampleProcedureProceduren Compute sum of all elements of vectorn Store result at destination locationvoid combine1(vec_ptr v, int *dest){ int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; }}– 12 –15-213, F’02Time ScalesTime ScalesAbsolute TimeAbsolute Timen Typically use nanosecondsl 10–9 secondsn Time scale of computer instructionsClock CyclesClock Cyclesn Most computers controlled by high frequency clock signaln Typical Rangel 100 MHz» 108 cycles per second» Clock period = 10nsl 2 GHz» 2 X 109 cycles per second» Clock period = 0.5nsn Fish machines: 550 MHz (1.8 ns clock period)– 13 –15-213,
View Full Document