Slide 1TodayPerformance RealitiesOptimizing CompilersLimitations of Optimizing CompilersGenerally Useful OptimizationsCompiler-Generated Code MotionReduction in StrengthShare Common SubexpressionsOptimization Blocker #1: Procedure CallsLower Case Conversion PerformanceConvert Loop To Goto FormCalling StrlenImproving PerformanceLower Case Conversion PerformanceOptimization Blocker: Procedure CallsMemory MattersMemory AliasingRemoving AliasingOptimization Blocker: Memory AliasingExploiting Instruction-Level ParallelismBenchmark Example: Data Type for VectorsBenchmark ComputationCycles Per Element (CPE)Benchmark PerformanceBasic OptimizationsEffect of Basic OptimizationsModern CPU DesignSuperscalar ProcessorNehalem CPUx86-64 Compilation of Combine4Combine4 = Serial Computation (OP = *)Loop UnrollingEffect of Loop UnrollingLoop Unrolling with ReassociationEffect of ReassociationReassociated ComputationLoop Unrolling with Separate AccumulatorsEffect of Separate AccumulatorsSeparate AccumulatorsUnrolling & AccumulatingUnrolling & Accumulating: Double *Unrolling & Accumulating: Int +Achievable PerformanceUsing Vector InstructionsWhat About Branches?Modern CPU DesignBranch OutcomesBranch PredictionBranch Prediction Through LoopBranch Misprediction InvalidationBranch Misprediction RecoveryEffect of Branch PredictionGetting High Performance1Program Optimization2TodayOverviewGenerally Useful OptimizationsCode motion/precomputationStrength reductionSharing of common subexpressionsRemoving unnecessary procedure callsOptimization BlockersProcedure callsMemory aliasingExploiting Instruction-Level ParallelismDealing with Conditionals3Performance RealitiesThere’s more to performance than asymptotic complexityConstant factors matter too!Easily see 10:1 performance range depending on how code is writtenMust optimize at multiple levels: algorithm, data representations, procedures, and loopsMust understand system to optimize performanceHow programs are compiled and executedHow to measure program performance and identify bottlenecksHow to improve performance without destroying code modularity and generality4Optimizing CompilersProvide efficient mapping of program to machineregister allocationcode selection and ordering (scheduling)dead code eliminationeliminating minor inefficienciesDon’t (usually) improve asymptotic efficiencyup to programmer to select best overall algorithmbig-O savings are (often) more important than constant factorsbut constant factors also matterHave difficulty overcoming “optimization blockers”potential memory aliasingpotential procedure side-effects5Limitations of Optimizing CompilersOperate under fundamental constraintMust not cause any change in program behaviorOften prevents it from making optimizations when would only affect behavior under pathological conditions.Behavior that may be obvious to the programmer can be obfuscated by languages and coding stylese.g., Data ranges may be more limited than variable types suggestMost analysis is performed only within proceduresWhole-program analysis is too expensive in most casesMost analysis is based only on static informationCompiler has difficulty anticipating run-time inputsWhen in doubt, the compiler must be conservative6Generally Useful OptimizationsOptimizations that you or the compiler should do regardless of processor / compilerCode MotionReduce frequency with which computation performedIf it will always produce same resultEspecially moving code out of loop long j; int ni = n*i; for (j = 0; j < n; j++)a[ni+j] = b[j];void set_row(double *a, double *b, long i, long n){ long j; for (j = 0; j < n; j++)a[n*i+j] = b[j];}7Compiler-Generated Code Motionset_row:testq %rcx, %rcx # Test njle .L4 # If 0, goto donemovq %rcx, %rax # rax = nimulq %rdx, %rax # rax *= ileaq (%rdi,%rax,8), %rdx # rowp = A + n*i*8movl $0, %r8d # j = 0.L3: # loop:movq (%rsi,%r8,8), %rax # t = b[j]movq %rax, (%rdx) # *rowp = taddq $1, %r8 # j++addq $8, %rdx # rowp++cmpq %r8, %rcx # Compare n:jjg .L3 # If >, goto loop.L4: # done:rep ; ret long j; long ni = n*i; double *rowp = a+ni; for (j = 0; j < n; j++)*rowp++ = b[j];void set_row(double *a, double *b, long i, long n){ long j; for (j = 0; j < n; j++)a[n*i+j] = b[j];}Where are the FP operations?8Reduction in StrengthReplace costly operation with simpler oneShift, add instead of multiply or divide16*x --> x << 4Utility machine dependentDepends on cost of multiply or divide instruction–On Intel Nehalem, integer multiply requires 3 CPU cyclesRecognize sequence of productsfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];int ni = 0;for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n;}9Share Common SubexpressionsReuse portions of expressionsCompilers often not very sophisticated in exploiting arithmetic properties/* Sum neighbors of i,j */up = val[(i-1)*n + j ];down = val[(i+1)*n + j ];left = val[i*n + j-1];right = val[i*n + j+1];sum = up + down + left + right;long inj = i*n + j;up = val[inj - n];down = val[inj + n];left = val[inj - 1];right = val[inj + 1];sum = up + down + left + right;3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*nleaq 1(%rsi), %rax # i+1leaq -1(%rsi), %r8 # i-1imulq %rcx, %rsi # i*nimulq %rcx, %rax # (i+1)*nimulq %rcx, %r8 # (i-1)*naddq %rdx, %rsi # i*n+jaddq %rdx, %rax # (i+1)*n+jaddq %rdx, %r8 # (i-1)*n+jimulq %rcx, %rsi # i*naddq %rdx, %rsi # i*n+jmovq %rsi, %rax # i*n+jsubq %rcx, %rax # i*n+j-nleaq (%rsi,%rcx), %rcx # i*n+j+n10void lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}Optimization Blocker #1: Procedure CallsProcedure to Convert String to Lower CaseExtracted from 213 lab submissions, Fall, 199811Lower Case Conversion PerformanceTime quadruples when double string lengthQuadratic performance0 50000 100000 150000 200000 2500 0 0 300000 350000 400000 450000 500000020406080100120140160180200lowerString lengthCPU seconds12Convert Loop To Goto Form strlen executed every iterationvoid lower(char *s){ int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop;
View Full Document