DOC PREVIEW
CMU CS 15740 - Lecture

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Programming for Performance CS 740 Oct. 4, 2000Performance MattersOptimizing CompilersLimitations of Optimizing CompilersWhat do compilers try to do?Matrix Multiply – Simple VersionMatrix Multiply – Hand OptimizedResultsWhy is Simple Better?Optimization blocker: pointersSGI’s Superior CompilerLoop InterchangeSoftware PipeliningSoftware Pipelining cont.Code Motion ExamplesOptimization Blocker: Procedure CallsRole of ProgrammerProgramming for PerformanceCS 740Oct. 4, 2000Topics•How architecture impacts your programs•How (and how not) to tune your codeCS 740 F’00– 2 –Performance MattersConstant factors count!•easily see 10:1 performance range depending on how code is written•must optimize at multiple levels: –algorithm, data representations, procedures, and loopsMust understand system to optimize performance•how programs are compiled and executed•how to measure program performance and identify bottlenecks•how to improve performance without destroying code modularity and generalityCS 740 F’00– 3 –Optimizing CompilersProvide efficient mapping of program to machine•register allocation•code selection and ordering•eliminating minor inefficienciesDon’t (usually) improve asymptotic efficiency•up to programmer to select best overall algorithm•big-O savings are (often) more important than constant factors–but constant factors also matterHave difficulty overcoming “optimization blockers”•potential memory aliasing•potential procedure side-effectsCS 740 F’00– 4 –Limitations of Optimizing CompilersBehavior that may be obvious to the programmer can be obfuscated by languages and coding styles•e.g., data ranges may be more limited than variable types suggest–e.g., using an “int” in C for what could be an enumerated typeMost analysis is performed only within procedures•whole-program analysis is too expensive in most casesMost analysis is based only on static information•compiler has difficulty anticipating run-time inputsWhen in doubt, the compiler must be conservative•cannot perform optimization if it changes program behavior under any realizable circumstance–even if circumstances seem quite bizarre and unlikelyCS 740 F’00– 5 –What do compilers try to do?Reduce the number of instructions•Dynamic•StaticTake advantage of parallelismEliminate useless workOptimize memory access patternsUse special hardware when availableCS 740 F’00– 6 –Matrix Multiply – Simple VersionHeavy use of memory operations, addition and multiplicationContains redundant operationsfor(i = 0; i < SIZE; i++) { for(j = 0; j < SIZE; j++) { for(k = 0; k < SIZE; k++) {c[i][j]+=a[i][k]*b[k][j]; } }}CS 740 F’00– 7 –Matrix Multiply – Hand OptimizedTurned array accesses into pointer dereferencesAssign to each element of c just oncefor(i = 0; i < SIZE; i++) { int *orig_pa = &a[i][0]; for(j = 0; j < SIZE; j++) { int *pa = orig_pa; int *pb = &a[0][j]; int sum = 0; for(k = 0; k < SIZE; k++) { sum += *pa * *pb; pa++; pb += SIZE; } c[i][j] = sum; }}for(i = 0; i < SIZE; i++) { for(j = 0; j < SIZE; j++) { for(k = 0; k < SIZE; k++) {c[i][j]+=a[i][k]*b[k][j]; } }}CS 740 F’00– 8 –ResultsIs the “optimized” code optimal?R10000 Simple Optimizedcc –O0 34.7s 27.4scc –O3 5.3s 8.0segcc –O9 10.1s 8.3s21164 Simple Optimizedcc –O0 40.5s 12.2scc –O5 16.7s 18.6segcc –O0 27.2s 19.5segcc –O9 12.3s 14.7sPentium IISimple Optimizedegcc –O9 28.4s 25.3sRS/6000 Simple OptimizedxlC –O3 63.9s 65.3sCS 740 F’00– 9 –Why is Simple Better?Easier for humans and the compiler to understand•The more the compiler knows the more it can doPointers are hard to analyze, arrays are easierYou never know how fast code will run until you time itThe transformations we did by hand good optimizers will do for us•And they will often do a better job than we can doPointers may cause aliases and data dependences where the array code had noneCS 740 F’00– 10 –Optimization blocker: pointersAliasing: if a compiler can’t tell what a pointer points at, it must be conservative and assume it can point at almost anythingEg:Could optimize to a much better loop if only we knew that our strings do not alias each othervoid strcpy(char *dst, char *src){ while(*(src++) != ‘\0’) *(dst++) = *src; *dst = ‘\0’;}CS 740 F’00– 11 –SGI’s Superior CompilerLoop unrolling•Central loop is unrolled 2XCode scheduling•Loads are moved up in the schedule to hide their latencyLoop interchange•Inner two loops are interchanged giving us ikj rather than ijk–Better cache performance – gives us a huge benefitSoftware pipelining•Do loads for next iteration while doing multiply for current iterationStrength reduction•Add 4 to current array location to get next one rather than multiplying by indexLoop invariant code motion•Values which are constants are not re-computed for each loop iterationCS 740 F’00– 12 –Loop InterchangeDoes any loop iteration read a value produced by any other iteration?What do the memory access patterns look like in the inner loop?•ijk: constant += sequential * striding•ikj: sequential += constant * sequential•jik: constant += sequential * sequential•jki: striding += striding * constant•kij: sequential += constant * sequential•kji: striding += striding * constantfor(i = 0; i < SIZE; i++) for(j = 0; j < SIZE; j++) for(k = 0; k < SIZE; k++)c[i][j]+=a[i][k]*b[k][j];CS 740 F’00– 13 –Software Pipeliningfor(j = 0; j < SIZE; j++) c_r[j] += a_r_c * b_r[j];load b_r[j] a_r_cload c_r[j] *+Dataflow graph:•Now must optimize inner loop•Want to do as much work as possible in each iteration•Keep all of the functional units busy in the processorstore c_r[j]CS 740 F’00– 14 –FillSteady StateDrainNot pipelined:for(j = 0; j < SIZE; j++) c_r[j] += a_r_c * b_r[j];Pipelined:Software Pipelining cont.load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]load b_r[j] a_r_cload c_r[j] *+store c_r[j]CS 740 F’00– 15 –Code Motion Examples•Sum Integers from 1 to n!BadBetterBestsum = 0;for (i = 0; i <= fact(n); i++) sum +=


View Full Document

CMU CS 15740 - Lecture

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?