Programming for Performance CS 740 Performance Matters Constant factors count easily see 10 1 performance range depending on how code is written must optimize at multiple levels algorithm data representations procedures and loops Oct 24 2003 Must understand system to optimize performance Topics how programs are compiled and executed how to measure program performance and identify bottlenecks how to improve performance without destroying code modularity and generality How architecture impacts your programs How and how not to tune your code Statically scheduled processors 2 Optimizing Compilers Provide efficient mapping of program to machine register allocation code selection and ordering eliminating minor inefficiencies Limitations of Optimizing Compilers Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles e g data ranges may be more limited than variable types suggest e g using an int in C for what could be an enumerated type Don t usually improve asymptotic efficiency up to programmer to select best overall algorithm big O savings are often more important than constant factors but constant factors also matter Have difficulty overcoming optimization blockers potential memory aliasing potential procedure side effects 3 CS 740 F 03 Most analysis is performed only within procedures whole program analysis is too expensive in most cases Most analysis is based only on static information compiler has difficulty anticipating run time inputs When in doubt the compiler must be conservative cannot perform optimization if it changes program behavior under any realizable circumstance even if circumstances seem quite bizarre and unlikely CS 740 F 03 4 CS 740 F 03 What do compilers try to do for i 0 i SIZE i for j 0 j SIZE j for k 0 k SIZE k c i j a i k b k j Reduce the number of instructions Dynamic Static Take advantage of parallelism Optimize memory access patterns Use special hardware when available 5 Matrix Multiply Simple Version Heavy use of memory operations addition and multiplication Contains redundant operations CS 740 F 03 6 Matrix Multiply Hand Optimized for i 0 i SIZE i for j 0 j SIZE j for k 0 k SIZE k c i j a i k b k j Turned array accesses into pointer dereferences Assign to each element of c just once 7 for i 0 i SIZE i int orig pa a i 0 for j 0 j SIZE j int pa orig pa int pb b 0 j int sum 0 for k 0 k SIZE k sum pa pb pa pb SIZE c i j sum CS 740 F 03 CS 740 F 03 Results Is the optimized code optimal 8 R10000 cc O0 cc O3 egcc O9 Simple 34 7s 5 3s 10 1s Optimized 27 4s 8 0s 8 3s 21164 cc O0 cc O5 egcc O0 egcc O9 Simple 40 5s 16 7s 27 2s 12 3s Optimized 12 2s 18 6s 19 5s 14 7s Pentium II Simple egcc O9 28 4s Optimized 25 3s RS 6000 xlC O3 Optimized 65 3s Simple 63 9s CS 740 F 03 Why is Simple Better Easier for humans and the compiler to understand The more the compiler knows the more it can do Pointers are hard to analyze arrays are easier You never know how fast code will run until you time it The transformations we did by hand good optimizers will do for us Optimization blocker pointers Aliasing if a compiler can t tell what a pointer points at it must be conservative and assume it can point at almost anything Eg void strcpy char dst char src while src 0 dst src dst 0 And they will often do a better job than we can do Pointers may cause aliases and data dependences where the array code had none 9 Could optimize to a much better loop if only we knew that our strings do not alias each other CS 740 F 03 10 SGI s Superior Compiler Loop Interchange for i 0 i SIZE i for j 0 j SIZE j for k 0 k SIZE k c i j a i k b k j Loop unrolling Central loop is unrolled 2X Code scheduling Loads are moved up in the schedule to hide their latency Loop interchange Inner two loops are interchanged giving us ikj rather than ijk Better cache performance gives us a huge benefit Software pipelining Do loads for next iteration while doing multiply for current iteration Strength reduction Add 4 to current array location to get next one rather than multiplying by index Loop invariant code motion Values which are constants are not re computed for each loop iteration 11 CS 740 F 03 CS 740 F 03 Does any loop iteration read a value produced by any other iteration What do the memory access patterns look like in the inner loop ijk ikj jik jki kij kji constant sequential striding sequential constant sequential constant sequential striding striding striding constant sequential constant sequential striding striding constant 12 CS 740 F 03 Software Pipelining for j 0 j SIZE j for j 0 j SIZE j c r j a r c b r j Pipelined c r j a r c b r j load b r j load b r j load c r j a r c store c r j load c r j a r c store c r j load b r j store c r j a r c load c r j load b r j a r c load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j store c r j load b r j load c r j load b r j Steady State Dataflow graph a r c Fill Not pipelined load c r j a r c Drain Now must optimize inner loop Want to do as much work as possible in each iteration Keep all of the functional units busy in the processor Software Pipelining cont store c r j store c r j 13 CS 740 F 03 Code Motion Examples 14 CS 740 F 03 Optimization Blocker Procedure Calls Why couldn t the compiler move fact n out of the inner loop Procedure May Have Side Effects Sum Integers from 1 to n Bad i e alters global state each time called sum 0 for i 0 i fact n i sum i Function May Not Return Same Value for Given Arguments Depends on other parts of global state Better Why doesn t compiler look at code for fact n sum 0 fn fact n for i 0 i fn i sum i sum 0 for i fact n i 0 i sum i Linker may overload with different version Unless declared static Interprocedural optimization …
View Full Document