Code OptimizationSept. 25, 2003Code OptimizationSept. 25, 2003TopicsTopics Machine-Independent Optimizations Machine Dependent Optimizations Code Profilingclass10.ppt15-213“The course that gives CMU its Zip!”– 2 –15-213, F’03Harsh RealityHarsh RealityThere’s more to performance than asymptotic complexityThere’s more to performance than asymptotic complexityConstant factors matter too!Constant factors matter too! Easily see 10:1 performance range depending on how code is written Must optimize at multiple levels: algorithm, data representations, procedures, and loopsMust understand system to optimize performanceMust understand system to optimize performance How programs are compiled and executed How to measure program performance and identify bottlenecks How to improve performance without destroying code modularity and generality– 3 –15-213, F’03Limitations of Optimizing CompilersLimitations of Optimizing CompilersOperate under fundamental constraintOperate under fundamental constraint Must not cause any change in program behavior under any possible condition Often prevents it from making optimizations when would only affect behavior under pathological conditions.Behavior that may be obvious to the programmer can be Behavior that may be obvious to the programmer can be obfuscated by languages and coding stylesobfuscated by languages and coding styles e.g., Data ranges may be more limited than variable types suggestMost analysis is performed only within proceduresMost analysis is performed only within procedures Whole-program analysis is too expensive in most casesMost analysis is based only on Most analysis is based only on staticstaticinformationinformation Compiler has difficulty anticipating run-time inputsWhen in doubt, the compiler must be conservativeWhen in doubt, the compiler must be conservative– 4 –15-213, F’03Machine-Independent OptimizationsMachine-Independent OptimizationsOptimizations that you or compiler should do Optimizations that you or compiler should do regardless of processor / compilerregardless of processor / compilerCode MotionCode Motion Reduce frequency with which computation performed If it will always produce same result Especially moving code out of loopfor (i = 0; i < n; i++)for (j = 0; j < n; j++)a[n*i + j] = b[j];for (i = 0; i < n; i++) {int ni = n*i;for (j = 0; j < n; j++)a[ni + j] = b[j];}– 5 –15-213, F’03Compiler-Generated Code MotionCompiler-Generated Code Motion Most compilers do a good job with array code + simple loop structuresCode Generated by GCCCode Generated by GCCfor (i = 0; i < n; i++)for (j = 0; j < n; j++)a[n*i + j] = b[j];imull %ebx,%eax # i*nmovl 8(%ebp),%edi # aleal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4)# Inner Loop.L40:movl 12(%ebp),%edi # bmovl (%edi,%ecx,4),%eax # b+j (scaled by 4)movl %eax,(%edx) # *p = b[j]addl $4,%edx # p++ (scaled by 4)incl %ecx # j++jl .L40 # loop if j<nfor (i = 0; i < n; i++) {int ni = n*i;int *p = a+ni;for (j = 0; j < n; j++)*p++ = b[j];}– 6 –15-213, F’03Reduction in StrengthReduction in Strength Replace costly operation with simpler one Shift, add instead of multiply or divide16*x --> x << 4 Utility machine dependent Depends on cost of multiply or divide instruction On Pentium II or III, integer multiply only requires 4 CPU cycles Recognize sequence of productsfor (i = 0; i < n; i++)for (j = 0; j < n; j++)a[n*i + j] = b[j];int ni = 0;for (i = 0; i < n; i++) {for (j = 0; j < n; j++)a[ni + j] = b[j];ni += n;}– 7 –15-213, F’03Share Common SubexpressionsShare Common Subexpressions Reuse portions of expressions Compilers often not very sophisticated in exploiting arithmetic properties/* Sum neighbors of i,j */up = val[(i-1)*n + j];down = val[(i+1)*n + j];left = val[i*n + j-1];right = val[i*n + j+1];sum = up + down + left + right;int inj = i*n + j;up = val[inj - n];down = val[inj + n];left = val[inj - 1];right = val[inj + 1];sum = up + down + left + right;3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*nleal -1(%edx),%ecx # i-1imull %ebx,%ecx # (i-1)*nleal 1(%edx),%eax # i+1imull %ebx,%eax # (i+1)*nimull %ebx,%edx # i*n– 8 –15-213, F’03Time ScalesTime ScalesAbsolute TimeAbsolute Time Typically use nanoseconds 10–9seconds Time scale of computer instructionsClock CyclesClock Cycles Most computers controlled by high frequency clock signal Typical Range 100 MHz » 108cycles per second» Clock period = 10ns 2 GHz » 2 X 109cycles per second» Clock period = 0.5ns Fish machines: 550 MHz (1.8 ns clock period)– 9 –15-213, F’03Cycles Per ElementCycles Per Element Convenient way to express performance of program that operators on vectors or lists Length = n T = CPE*n + Overhead010020030040050060070080090010000 50 100 150 200ElementsCyclesvsum1Slope = 4.0vsum2Slope = 3.5– 10 –15-213, F’03Vector Abstract Data Type (ADT)Vector Abstract Data Type (ADT)ProceduresProceduresvec_ptr new_vec(int len) Create vector of specified lengthint get_vec_element(vec_ptr v, int index, int *dest) Retrieve vector element, store at *dest Return 0 if out of bounds, 1 if successfulint *get_vec_start(vec_ptr v) Return pointer to start of vector data Similar to array implementations in Pascal, ML, Java E.g., always do bounds checkinglengthdata•••• •••• ••••0 1 2 length–1– 11 –15-213, F’03Optimization ExampleOptimization ExampleProcedureProcedure Compute sum of all elements of integer vector Store result at destination location Vector data structure and operations defined via abstract data typePentium II/III Performance: Clock Cycles / ElementPentium II/III Performance: Clock Cycles / Element 42.06 (Compiled -g) 31.25 (Compiled -O2)void combine1(vec_ptr v, int *dest){int i;*dest = 0;for (i = 0; i < vec_length(v); i++) {int val;get_vec_element(v, i, &val);*dest += val;}}– 12 –15-213, F’03Understanding LoopUnderstanding LoopInefficiencyInefficiency Procedure vec_length called every iteration Even though result always the samevoid combine1-goto(vec_ptr v, int *dest){int i = 0;int val;*dest = 0;if (i >= vec_length(v))goto done;loop:get_vec_element(v, i, &val);*dest += val;i++;if (i < vec_length(v))goto loopdone:}1 iteration– 13 –15-213, F’03Move vec_length Call Out of LoopMove vec_length Call Out of LoopOptimizationOptimization Move call to
View Full Document