DOC PREVIEW
UT CS 429H - Program Optimization

This preview shows page 1-2-3-4-25-26-27-51-52-53-54 out of 54 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 54 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1TodayPerformance RealitiesOptimizing CompilersLimitations of Optimizing CompilersGenerally Useful OptimizationsCompiler-Generated Code MotionReduction in StrengthShare Common SubexpressionsOptimization Blocker #1: Procedure CallsLower Case Conversion PerformanceConvert Loop To Goto FormCalling StrlenImproving PerformanceLower Case Conversion PerformanceOptimization Blocker: Procedure CallsMemory MattersMemory AliasingRemoving AliasingOptimization Blocker: Memory AliasingExploiting Instruction-Level ParallelismBenchmark Example: Data Type for VectorsBenchmark ComputationCycles Per Element (CPE)Benchmark PerformanceBasic OptimizationsEffect of Basic OptimizationsModern CPU DesignSuperscalar ProcessorNehalem CPUx86-64 Compilation of Combine4Combine4 = Serial Computation (OP = *)Loop UnrollingEffect of Loop UnrollingLoop Unrolling with ReassociationEffect of ReassociationReassociated ComputationLoop Unrolling with Separate AccumulatorsEffect of Separate AccumulatorsSeparate AccumulatorsUnrolling & AccumulatingUnrolling & Accumulating: Double *Unrolling & Accumulating: Int +Achievable PerformanceUsing Vector InstructionsWhat About Branches?Modern CPU DesignBranch OutcomesBranch PredictionBranch Prediction Through LoopBranch Misprediction InvalidationBranch Misprediction RecoveryEffect of Branch PredictionGetting High Performance1Program Optimization2TodayOverviewGenerally Useful OptimizationsCode motion/precomputationStrength reductionSharing of common subexpressionsRemoving unnecessary procedure callsOptimization BlockersProcedure callsMemory aliasingExploiting Instruction-Level ParallelismDealing with Conditionals3Performance RealitiesThere’s more to performance than asymptotic complexityConstant factors matter too!Easily see 10:1 performance range depending on how code is writtenMust optimize at multiple levels: algorithm, data representations, procedures, and loopsMust understand system to optimize performanceHow programs are compiled and executedHow to measure program performance and identify bottlenecksHow to improve performance without destroying code modularity and generality4Optimizing CompilersProvide efficient mapping of program to machineregister allocationcode selection and ordering (scheduling)dead code eliminationeliminating minor inefficienciesDon’t (usually) improve asymptotic efficiencyup to programmer to select best overall algorithmbig-O savings are (often) more important than constant factorsbut constant factors also matterHave difficulty overcoming “optimization blockers”potential memory aliasingpotential procedure side-effects5Limitations of Optimizing CompilersOperate under fundamental constraintMust not cause any change in program behaviorOften prevents it from making optimizations when would only affect behavior under pathological conditions.Behavior that may be obvious to the programmer can be obfuscated by languages and coding stylese.g., Data ranges may be more limited than variable types suggestMost analysis is performed only within proceduresWhole-program analysis is too expensive in most casesMost analysis is based only on static informationCompiler has difficulty anticipating run-time inputsWhen in doubt, the compiler must be conservative6Generally Useful OptimizationsOptimizations that you or the compiler should do regardless of processor / compilerCode MotionReduce frequency with which computation performedIf it will always produce same resultEspecially moving code out of loop long j; int ni = n*i; for (j = 0; j < n; j++)a[ni+j] = b[j];void set_row(double *a, double *b, long i, long n){ long j; for (j = 0; j < n; j++)a[n*i+j] = b[j];}7Compiler-Generated Code Motionset_row:testq %rcx, %rcx # Test njle .L4 # If 0, goto donemovq %rcx, %rax # rax = nimulq %rdx, %rax # rax *= ileaq (%rdi,%rax,8), %rdx # rowp = A + n*i*8movl $0, %r8d # j = 0.L3: # loop:movq (%rsi,%r8,8), %rax # t = b[j]movq %rax, (%rdx) # *rowp = taddq $1, %r8 # j++addq $8, %rdx # rowp++cmpq %r8, %rcx # Compare n:jjg .L3 # If >, goto loop.L4: # done:rep ; ret long j; long ni = n*i; double *rowp = a+ni; for (j = 0; j < n; j++)*rowp++ = b[j];void set_row(double *a, double *b, long i, long n){ long j; for (j = 0; j < n; j++)a[n*i+j] = b[j];}Where are the FP operations?8Reduction in StrengthReplace costly operation with simpler oneShift, add instead of multiply or divide16*x --> x << 4Utility machine dependentDepends on cost of multiply or divide instruction–On Intel Nehalem, integer multiply requires 3 CPU cyclesRecognize sequence of productsfor (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];int ni = 0;for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n;}9Share Common SubexpressionsReuse portions of expressionsCompilers often not very sophisticated in exploiting arithmetic properties/* Sum neighbors of i,j */up = val[(i-1)*n + j ];down = val[(i+1)*n + j ];left = val[i*n + j-1];right = val[i*n + j+1];sum = up + down + left + right;long inj = i*n + j;up = val[inj - n];down = val[inj + n];left = val[inj - 1];right = val[inj + 1];sum = up + down + left + right;3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*nleaq 1(%rsi), %rax # i+1leaq -1(%rsi), %r8 # i-1imulq %rcx, %rsi # i*nimulq %rcx, %rax # (i+1)*nimulq %rcx, %r8 # (i-1)*naddq %rdx, %rsi # i*n+jaddq %rdx, %rax # (i+1)*n+jaddq %rdx, %r8 # (i-1)*n+jimulq %rcx, %rsi # i*naddq %rdx, %rsi # i*n+jmovq %rsi, %rax # i*n+jsubq %rcx, %rax # i*n+j-nleaq (%rsi,%rcx), %rcx # i*n+j+n10void lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}Optimization Blocker #1: Procedure CallsProcedure to Convert String to Lower CaseExtracted from 213 lab submissions, Fall, 199811Lower Case Conversion PerformanceTime quadruples when double string lengthQuadratic performance0 50000 100000 150000 200000 2500 0 0 300000 350000 400000 450000 500000020406080100120140160180200lowerString lengthCPU seconds12Convert Loop To Goto Form strlen executed every iterationvoid lower(char *s){ int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop;


View Full Document

UT CS 429H - Program Optimization

Download Program Optimization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Program Optimization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Program Optimization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?