Page 1Programming for PerformanceCS 740Oct. 7, 2002Topics• How architecture impacts your programs• How (and how not) to tune your code• Statically scheduled processorsCS 740 F’02–2–Performance MattersConstant factors count!• easily see 10:1 performance range depending on how code is written• must optimize at multiple levels: – algorithm, data representations, procedures, and loopsMust understand system to optimize performance• how programs are compiled and executed• how to measure program performance and identify bottlenecks• how to improve performance without destroying code modularity and generalityPage 2CS 740 F’02–3–Optimizing CompilersProvide efficient mapping of program to machine• register allocation• code selection and ordering• eliminating minor inefficienciesDon’t (usually) improve asymptotic efficiency• up to programmer to select best overall algorithm• big-O savings are (often) more important than constant factors– but constant factors also matterHave difficulty overcoming “optimization blockers”• potential memory aliasing• potential procedure side-effectsCS 740 F’02–4–Limitations of Optimizing CompilersBehavior that may be obvious to the programmer can be obfuscated by languages and coding styles• e.g., data ranges may be more limited than variable types suggest– e.g., using an “int” in C for what could be an enumerated typeMost analysis is performed only within procedures• whole-program analysis is too expensive in most casesMost analysis is based only on staticinformation• compiler has difficulty anticipating run-time inputsWhen in doubt, the compiler must be conservative• cannot perform optimization if it changes program behavior underanyrealizable circumstance– even if circumstances seem quite bizarre and unlikelyPage 3CS 740 F’02–5–What do compilers try to do?Reduce the number of instructions• Dynamic• StaticTake advantage of parallelismOptimize memory access patternsUse special hardware when availableCS 740 F’02–6–Matrix Multiply – Simple VersionHeavy use of memory operations, addition and multiplicationContains redundant operationsfor(i = 0; i < SIZE; i++) {for(j = 0; j < SIZE; j++) {for(k = 0; k < SIZE; k++) {c[i][j]+=a[i][k]*b[k][j];}}}Page 4CS 740 F’02–7–Matrix Multiply – Hand OptimizedTurned array accesses into pointer dereferencesAssign to each element of c just oncefor(i = 0; i < SIZE; i++) {int *orig_pa = &a[i][0];for(j = 0; j < SIZE; j++) {int *pa = orig_pa;int *pb = &a[0][j];int sum = 0;for(k = 0; k < SIZE; k++) {sum += *pa * *pb;pa++;pb += SIZE;}c[i][j] = sum;}}for(i = 0; i < SIZE; i++) {for(j = 0; j < SIZE; j++) {for(k = 0; k < SIZE; k++) {c[i][j]+=a[i][k]*b[k][j];}}}CS 740 F’02–8–ResultsIs the “optimized” code optimal?8.3s10.1segcc –O98.0s5.3scc –O327.4s34.7scc –O0OptimizedSimpleR1000014.7s12.3segcc –O919.5s27.2segcc –O018.6s16.7scc –O512.2s40.5scc –O0OptimizedSimple2116425.3s28.4segcc –O9OptimizedSimplePentium II65.3s63.9sxlC –O3OptimizedSimpleRS/6000Page 5CS 740 F’02–9–Why is Simple Better?Easier for humans andthe compiler to understand• The more the compiler knows the more it can doPointers are hard to analyze, arrays are easierYou neverknow how fast code will run until you time itThe transformations we did by hand good optimizers will do for us•And they will often do a better job than we can doPointers may cause aliases and data dependences where the array code had noneCS 740 F’02–10–Optimization blocker: pointersAliasing: if a compiler can’t tell what a pointer points at, it must be conservative and assume it can point at almost anythingEg:Could optimize to a much better loop if only we knew that our strings do not alias each othervoid strcpy(char *dst, char *src){while(*(src++) != ‘\0’)*(dst++) = *src;*dst = ‘\0’;}Page 6CS 740 F’02–11–SGI’s Superior CompilerLoop unrolling• Central loop is unrolled 2XCode scheduling• Loads are moved up in the schedule to hide their latencyLoop interchange• Inner two loops are interchanged giving us ikj rather than ijk– Better cache performance – gives us a huge benefitSoftware pipelining• Do loads for next iteration while doing multiply for current iterationStrength reduction• Add 4 to current array location to get next one rather than multiplying by indexLoop invariant code motion• Values which are constants are not re-computed for each loop iterationCS 740 F’02–12–Loop InterchangeDoes any loop iteration read a value produced by any other iteration?What do the memory access patterns look like in the inner loop?• ijk: constant += sequential * striding• ikj: sequential += constant * sequential• jik: constant += sequential * striding• jki: striding += striding * constant• kij: sequential += constant * sequential• kji: striding += striding * constantfor(i = 0; i < SIZE; i++)for(j = 0; j < SIZE; j++)for(k = 0; k < SIZE; k++)c[i][j]+=a[i][k]*b[k][j];Page 7CS 740 F’02–13–Software Pipeliningfor(j = 0; j < SIZE; j++)c_r[j] += a_r_c * b_r[j];load b_r[j] a_r_cload c_r[j] *+Dataflow graph:• Now must optimize inner loop• Want to do as much work as possible in each iteration• Keep all of the functional units busy in the processorstore c_r[j]CS 740 F’02–14–FillSteady StateDrainNot pipelined:for(j = 0; j < SIZE; j++)c_r[j] += a_r_c * b_r[j];Pipelined:Software Pipelining cont.load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]Page 8CS 740 F’02–15–Code Motion Examples• Sum Integers from 1 to n!BadBetterBestsum = 0;for (i = 0; i <= fact(n); i++)sum += i;sum = 0;fn = fact(n);for (i = 0; i <= fn; i++)sum += i;sum = 0;for (i = fact(n); i > 0; i--)sum += i;fn = fact(n);sum = fn * (fn + 1) / 2;CS 740 F’02–16–Optimization Blocker: Procedure CallsWhy couldn’t the compiler move fact(n) out of the inner loop?Procedure May Have Side Effects• i.e, alters global state each time calledFunction May Not Return Same Value for Given Arguments• Depends on other parts of global stateWhy doesn’t compiler look at code for fact(n)?•
View Full Document