Unformatted text preview:

Programming for Performance CS 740 Performance Matters Constant factors count easily see 10 1 performance range depending on how code is written must optimize at multiple levels algorithm data representations procedures and loops Oct 7 2002 Must understand system to optimize performance Topics how programs are compiled and executed how to measure program performance and identify bottlenecks how to improve performance without destroying code modularity and generality How architecture impacts your programs How and how not to tune your code Statically scheduled processors 2 Page 1 CS 740 F 02 Optimizing Compilers Limitations of Optimizing Compilers Provide efficient mapping of program to machine Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles register allocation code selection and ordering eliminating minor inefficiencies e g data ranges may be more limited than variable types suggest e g using an int in C for what could be an enumerated type Most analysis is performed only within procedures Don t usually improve asymptotic efficiency up to programmer to select best overall algorithm big O savings are often more important than constant factors but constant factors also matter whole program analysis is too expensive in most cases Most analysis is based only on static information compiler has difficulty anticipating run time inputs Have difficulty overcoming optimization blockers When in doubt the compiler must be conservative potential memory aliasing potential procedure side effects 3 cannot perform optimization if it changes program behavior under any realizable circumstance even if circumstances seem quite bizarre and unlikely CS 740 F 02 4 Page 2 CS 740 F 02 What do compilers try to do Matrix Multiply Simple Version for i 0 i SIZE i for j 0 j SIZE j for k 0 k SIZE k c i j a i k b k j Reduce the number of instructions Dynamic Static Take advantage of parallelism Optimize memory access patterns Use special hardware when available 5 Heavy use of memory operations addition and multiplication Contains redundant operations CS 740 F 02 6 Page 3 CS 740 F 02 Matrix Multiply Hand Optimized for i 0 i SIZE i for j 0 j SIZE j for k 0 k SIZE k c i j a i k b k j Turned array accesses into pointer dereferences Assign to each element of c just once 7 Results for i 0 i SIZE i int orig pa a i 0 for j 0 j SIZE j int pa orig pa int pb a 0 j int sum 0 for k 0 k SIZE k sum pa pb pa pb SIZE c i j sum Is the optimized code optimal CS 740 F 02 8 Page 4 R10000 cc O0 cc O3 egcc O9 Simple 34 7s 5 3s 10 1s Optimized 27 4s 8 0s 8 3s 21164 cc O0 cc O5 egcc O0 egcc O9 Simple 40 5s 16 7s 27 2s 12 3s Optimized 12 2s 18 6s 19 5s 14 7s Pentium II Simple egcc O9 28 4s Optimized 25 3s RS 6000 xlC O3 Optimized 65 3s Simple 63 9s CS 740 F 02 Why is Simple Better Optimization blocker pointers Easier for humans and the compiler to understand Aliasing if a compiler can t tell what a pointer points at it must be conservative and assume it can point at almost anything Eg void strcpy char dst char src The more the compiler knows the more it can do Pointers are hard to analyze arrays are easier You never know how fast code will run until you time it The transformations we did by hand good optimizers will do for us while src 0 dst src dst 0 And they will often do a better job than we can do Pointers may cause aliases and data dependences where the array code had none 9 Could optimize to a much better loop if only we knew that our strings do not alias each other CS 740 F 02 10 Page 5 CS 740 F 02 SGI s Superior Compiler Loop Interchange for i 0 i SIZE i for j 0 j SIZE j for k 0 k SIZE k c i j a i k b k j Loop unrolling Central loop is unrolled 2X Code scheduling Loads are moved up in the schedule to hide their latency Does any loop iteration read a value produced by any other iteration What do the memory access patterns look like in the inner loop Loop interchange Inner two loops are interchanged giving us ikj rather than ijk Better cache performance gives us a huge benefit Software pipelining Do loads for next iteration while doing multiply for current iteration Strength reduction Add 4 to current array location to get next one rather than multiplying by index Loop invariant code motion Values which are constants are not re computed for each loop iteration 11 CS 740 F 02 ijk ikj jik jki kij kji constant sequential striding sequential constant sequential constant sequential striding striding striding constant sequential constant sequential striding striding constant 12 Page 6 CS 740 F 02 Software Pipelining Software Pipelining cont for j 0 j SIZE j Now must optimize inner for j 0 j SIZE j c r j a r c b r j loop Want to do as much work as possible in each Dataflow graph iteration load b r j a r c Keep all of the functional units busy in load c r j the processor c r j a r c b r j Pipelined load b r j a r c load c r j load b r j a r c load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j store c r j load b r j load c r j Fill Not pipelined a r c load c r j Steady State store c r j load b r j a r c store c r j load b r j load c r j store c r j a r c Drain store c r j store c r j 13 CS 740 F 02 14 Page 7 CS 740 F 02 Code Motion Examples Optimization Blocker Procedure Calls Why couldn t the compiler move fact n out of the inner loop Sum Integers from 1 to n Bad Procedure May Have Side Effects i e alters global state each time called sum 0 for i 0 i fact n i sum i Function May Not Return Same Value for Given Arguments Depends on other parts of global state Better Why doesn t compiler look at code for fact n sum 0 fn fact n for i 0 i fn i sum i Linker may overload with different version Unless declared …


View Full Document

CMU CS 15740 - lecture 09

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Loading Unlocking...
Login

Join to view lecture 09 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 09 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?