DOC PREVIEW
CMU CS 15740 - Lecture

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Programming for Performance CS 740 Performance Matters Constant factors count easily see 10 1 performance range depending on how code is written must optimize at multiple levels algorithm data representations procedures and loops Oct 24 2003 Must understand system to optimize performance Topics how programs are compiled and executed how to measure program performance and identify bottlenecks how to improve performance without destroying code modularity and generality How architecture impacts your programs How and how not to tune your code Statically scheduled processors 2 Optimizing Compilers Provide efficient mapping of program to machine register allocation code selection and ordering eliminating minor inefficiencies Limitations of Optimizing Compilers Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles e g data ranges may be more limited than variable types suggest e g using an int in C for what could be an enumerated type Don t usually improve asymptotic efficiency up to programmer to select best overall algorithm big O savings are often more important than constant factors but constant factors also matter Have difficulty overcoming optimization blockers potential memory aliasing potential procedure side effects 3 CS 740 F 03 Most analysis is performed only within procedures whole program analysis is too expensive in most cases Most analysis is based only on static information compiler has difficulty anticipating run time inputs When in doubt the compiler must be conservative cannot perform optimization if it changes program behavior under any realizable circumstance even if circumstances seem quite bizarre and unlikely CS 740 F 03 4 CS 740 F 03 What do compilers try to do for i 0 i SIZE i for j 0 j SIZE j for k 0 k SIZE k c i j a i k b k j Reduce the number of instructions Dynamic Static Take advantage of parallelism Optimize memory access patterns Use special hardware when available 5 Matrix Multiply Simple Version Heavy use of memory operations addition and multiplication Contains redundant operations CS 740 F 03 6 Matrix Multiply Hand Optimized for i 0 i SIZE i for j 0 j SIZE j for k 0 k SIZE k c i j a i k b k j Turned array accesses into pointer dereferences Assign to each element of c just once 7 for i 0 i SIZE i int orig pa a i 0 for j 0 j SIZE j int pa orig pa int pb b 0 j int sum 0 for k 0 k SIZE k sum pa pb pa pb SIZE c i j sum CS 740 F 03 CS 740 F 03 Results Is the optimized code optimal 8 R10000 cc O0 cc O3 egcc O9 Simple 34 7s 5 3s 10 1s Optimized 27 4s 8 0s 8 3s 21164 cc O0 cc O5 egcc O0 egcc O9 Simple 40 5s 16 7s 27 2s 12 3s Optimized 12 2s 18 6s 19 5s 14 7s Pentium II Simple egcc O9 28 4s Optimized 25 3s RS 6000 xlC O3 Optimized 65 3s Simple 63 9s CS 740 F 03 Why is Simple Better Easier for humans and the compiler to understand The more the compiler knows the more it can do Pointers are hard to analyze arrays are easier You never know how fast code will run until you time it The transformations we did by hand good optimizers will do for us Optimization blocker pointers Aliasing if a compiler can t tell what a pointer points at it must be conservative and assume it can point at almost anything Eg void strcpy char dst char src while src 0 dst src dst 0 And they will often do a better job than we can do Pointers may cause aliases and data dependences where the array code had none 9 Could optimize to a much better loop if only we knew that our strings do not alias each other CS 740 F 03 10 SGI s Superior Compiler Loop Interchange for i 0 i SIZE i for j 0 j SIZE j for k 0 k SIZE k c i j a i k b k j Loop unrolling Central loop is unrolled 2X Code scheduling Loads are moved up in the schedule to hide their latency Loop interchange Inner two loops are interchanged giving us ikj rather than ijk Better cache performance gives us a huge benefit Software pipelining Do loads for next iteration while doing multiply for current iteration Strength reduction Add 4 to current array location to get next one rather than multiplying by index Loop invariant code motion Values which are constants are not re computed for each loop iteration 11 CS 740 F 03 CS 740 F 03 Does any loop iteration read a value produced by any other iteration What do the memory access patterns look like in the inner loop ijk ikj jik jki kij kji constant sequential striding sequential constant sequential constant sequential striding striding striding constant sequential constant sequential striding striding constant 12 CS 740 F 03 Software Pipelining for j 0 j SIZE j for j 0 j SIZE j c r j a r c b r j Pipelined c r j a r c b r j load b r j load b r j load c r j a r c store c r j load c r j a r c store c r j load b r j store c r j a r c load c r j load b r j a r c load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j store c r j load b r j load c r j load b r j Steady State Dataflow graph a r c Fill Not pipelined load c r j a r c Drain Now must optimize inner loop Want to do as much work as possible in each iteration Keep all of the functional units busy in the processor Software Pipelining cont store c r j store c r j 13 CS 740 F 03 Code Motion Examples 14 CS 740 F 03 Optimization Blocker Procedure Calls Why couldn t the compiler move fact n out of the inner loop Procedure May Have Side Effects Sum Integers from 1 to n Bad i e alters global state each time called sum 0 for i 0 i fact n i sum i Function May Not Return Same Value for Given Arguments Depends on other parts of global state Better Why doesn t compiler look at code for fact n sum 0 fn fact n for i 0 i fn i sum i sum 0 for i fact n i 0 i sum i Linker may overload with different version Unless declared static Interprocedural optimization …


View Full Document

CMU CS 15740 - Lecture

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?