Code Optimization and PerformanceTopicsSpeed and optimizationGreat Reality #4Optimizing CompilersLimitations of Optimizing CompilersNew Topic: Machine-Independent OptimizationsCompiler-Generated Code MotionReduction in StrengthMake Use of RegistersMachine-Independent Opts. (Cont.)Example: Vector ADTOptimization ExampleTime ScalesCycles Per ElementSlide 16Move vec_length Call Out of LoopCode Motion Example #2Lowercase-Conversion PerformanceSlide 20Improving PerformanceSlide 22Optimization Blocker: Procedure CallsSlide 24Eliminate Unneeded Memory RefsDetecting Unneeded Memory Refs.Optimization Blocker: Memory AliasingMachine-Independent Optimization SummaryPointer CodePointer vs. Array Code Inner LoopsImportant ToolsNew Topic: Code Profiling ExampleCode ProfilingProfiling ResultsCode OptimizationsFurther OptimizationsProfiling ObservationsCode Optimization and Performance Code Optimization and Performance Chapter 5Chapter 5perf01.pptCS 105“Tour of the Black Holes of Computing”– 2 –CS 105TopicsTopicsMachine-independentMachine-independent optimizationsoptimizationsCode motionReduction in strengthCommon subexpression sharingTuning: Tuning: Identifying performance bottlenecksMachine-dependent optimizationsPointer codeLoop unrollingEnabling instruction-level parallelismUnderstanding processor optimizationTranslation of instructions into operationsOut-of-order executionBranchesCaches and BlockingAdvice– 3 –CS 105Speed and optimizationSpeed and optimizationProgrammerProgrammerChoice of algorithmIntelligent codingCompilerCompilerChoice of instructionsMoving codeReordering codeStrength reductionMust be faithful to original programProcessorProcessorPipeliningMultiple execution unitsMemory accessesBranchesCaches Rest of systemRest of systemUncontrollable– 4 –CS 105Great Reality #4Great Reality #4There’s more to performance thanThere’s more to performance thanasymptotic complexityasymptotic complexityConstant factors matter too!Constant factors matter too!Easily see 10:1 performance range depending on how code is writtenMust optimize at multiple levels: Algorithm, data representations, procedures, and loopsMust understand system to optimize performanceMust understand system to optimize performanceHow programs are compiled and executedHow to measure program performance and identify bottlenecksHow to improve performance without destroying code modularity, generality, readability– 5 –CS 105Optimizing CompilersOptimizing CompilersProvide efficient mapping of program to machineProvide efficient mapping of program to machineRegister allocationCode selection and orderingEliminating minor inefficienciesDon’t (usually) improve asymptotic efficiencyDon’t (usually) improve asymptotic efficiencyUp to programmer to select best overall algorithmBig-O savings are (often) more important than constant factorsBut constant factors also matterHave difficulty overcoming “optimization blockers”Have difficulty overcoming “optimization blockers”Potential memory aliasingPotential procedure side-effects– 6 –CS 105Limitationsof Optimizing CompilersLimitationsof Optimizing CompilersOperate Under Fundamental ConstraintOperate Under Fundamental ConstraintMust not cause any change in program behavior under any possible conditionOften prevents making optimizations that would only affect behavior under pathological conditions.Behavior that may be obvious to the programmer can be Behavior that may be obvious to the programmer can be obfuscated by languages and coding stylesobfuscated by languages and coding stylesE.g., data ranges may be more limited than variable types suggestMost analysis is performed only within proceduresMost analysis is performed only within proceduresWhole-program analysis is too expensive in most casesMost analysis is based only on Most analysis is based only on staticstatic information informationCompiler has difficulty anticipating run-time inputsWhen in doubt, the compiler must be conservativeWhen in doubt, the compiler must be conservative– 7 –CS 105New Topic:Machine-Independent OptimizationsNew Topic:Machine-Independent OptimizationsOptimizations you should do regardless of processor / compilerCode MotionCode MotionReduce frequency with which computation performedIf it will always produce same resultEspecially moving code out of loopfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j];}– 8 –CS 105Compiler-Generated Code MotionCompiler-Generated Code MotionMost compilers do a good job with array code + simple loop structuresCode Generated by GCCCode Generated by GCCfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; imull %ebx,%eax # i*n movl 8(%ebp),%edi # a leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4)# Inner Loop.L40: movl 12(%ebp),%edi # b movl (%edi,%ecx,4),%eax # b+j (scaled by 4) movl %eax,(%edx) # *p = b[j] addl $4,%edx # p++ (scaled by 4) incl %ecx # j++ cmpl %ebx,%ecx # j : n (reversed) jl .L40 # loop if j<nfor (i = 0; i < n; i++) { int ni = n*i; int *p = a+ni; for (j = 0; j < n; j++) *p++ = b[j];}– 9 –CS 105Reduction in StrengthReduction in StrengthReplace costly operation with simpler oneShift, add instead of multiply or divide16*x --> x << 4Utility is machine-dependentDepends on cost of multiply or divide instructionOn Pentium II or III, integer multiply only requires 4 CPU cyclesRecognize sequence of productsfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];int ni = 0;for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n;}– 10 –CS 105Make Use of RegistersMake Use of RegistersReading and writing registers much faster than reading/writing memoryLimitationLimitationCompiler not always able to determine whether variable can be held in registerPossibility of aliasingSee example later– 11 –CS 105Machine-Independent Opts. (Cont.)Machine-Independent Opts. (Cont.)Share Common SubexpressionsShare Common SubexpressionsReuse portions of expressionsCompilers often not very sophisticated in exploiting arithmetic properties/* Sum neighbors of i,j */up = val[(i-1)*n + j];down = val[(i+1)*n + j];left = val[i*n + j-1];right = val[i*n +
View Full Document