CMU CS 15740 - Lecture - D914250

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15740> Lecture

CMU CS 15740 - Lecture

Pages 17

Download Save

Unformatted text preview:

Programming for Performance CS 740 Oct 4 2000 Topics How architecture impacts your programs How and how not to tune your code Performance Matters Constant factors count easily see 10 1 performance range depending on how code is written must optimize at multiple levels algorithm data representations procedures and loops Must understand system to optimize performance how programs are compiled and executed how to measure program performance and identify bottlenecks how to improve performance without destroying code modularity and generality 2 CS 740 F 00 Optimizing Compilers Provide efficient mapping of program to machine register allocation code selection and ordering eliminating minor inefficiencies Don t usually improve asymptotic efficiency up to programmer to select best overall algorithm big O savings are often more important than constant factors but constant factors also matter Have difficulty overcoming optimization blockers potential memory aliasing potential procedure side effects 3 CS 740 F 00 Limitations of Optimizing Compilers Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles e g data ranges may be more limited than variable types suggest e g using an int in C for what could be an enumerated type Most analysis is performed only within procedures whole program analysis is too expensive in most cases Most analysis is based only on static information compiler has difficulty anticipating run time inputs When in doubt the compiler must be conservative cannot perform optimization if it changes program behavior under any realizable circumstance even if circumstances seem quite bizarre and unlikely 4 CS 740 F 00 What do compilers try to do Reduce the number of instructions Dynamic Static Take advantage of parallelism Eliminate useless work Optimize memory access patterns Use special hardware when available 5 CS 740 F 00 Matrix Multiply Simple Version for i 0 i SIZE i for j 0 j SIZE j for k 0 k SIZE k c i j a i k b k j Heavy use of memory operations addition and multiplication Contains redundant operations 6 CS 740 F 00 Matrix Multiply Hand Optimized for i 0 i SIZE i for i 0 i SIZE i for j 0 j SIZE j for k 0 k SIZE k c i j a i k b k j int orig pa a i 0 for j 0 j SIZE j int pa orig pa int pb a 0 j int sum 0 for k 0 k SIZE k sum pa pb pa pb SIZE c i j sum Turned array accesses into pointer dereferences Assign to each element of c just once 7 CS 740 F 00 Is the optimized code optimal 8 Resul ts R10000 Simple Optimize d cc O0 34 7s 27 4s cc O3 5 3s 8 0s egcc O9 21164 10 1s Simple 8 3s Optimize d cc O0 40 5s 12 2s cc O5 16 7s 18 6s egcc O0 27 2s 19 5s egcc O9 Pentium II 12 3s Simple 14 7s Optimize d egcc O9 RS 6000 28 4s Simple 25 3s Optimize d xlC O3 63 9s CS 740 65 3s F 00 Why is Simple Better Easier for humans and the compiler to understand The more the compiler knows the more it can do Pointers are hard to analyze arrays are easier You never know how fast code will run until you time it The transformations we did by hand good optimizers will do for us And they will often do a better job than we can do Pointers may cause aliases and data dependences where the array code had none 9 CS 740 F 00 Optimization blocker pointers Aliasing if a compiler can t tell what a pointer points at it must be conservative and assume it can point at almost anything Eg void strcpy char dst char src while src 0 dst src dst 0 Could optimize to a much better loop if only we knew that our strings do not alias each other 10 CS 740 F 00 SGI s Superior Compiler Loop unrolling Central loop is unrolled 2X Code scheduling Loads are moved up in the schedule to hide their latency Loop interchange Inner two loops are interchanged giving us ikj rather than ijk Better cache performance gives us a huge benefit Software pipelining Do loads for next iteration while doing multiply for current iteration Strength reduction Add 4 to current array location to get next one rather than multiplying by index Loop invariant code motion Values which are constants are not re computed for each loop iteration 11 CS 740 F 00 Loop Interchange 0 i SIZE i for i for j 0 j SIZE j for k 0 k SIZE k c i j a i k b k j Does any loop iteration read a value produced by any other iteration What do the memory access patterns look like in the inner loop ijk ikj jik jki kij kji constant sequential striding sequential constant sequential constant sequential sequential striding striding constant sequential constant sequential striding striding constant 12 CS 740 F 00 Software Pipelining for j 0 Now must optimize inner loop Want to do as much work as possible in each iteration Keep all of the functional units busy in the processor j SIZE j c r j a r c b r j Dataflow graph load b r j a r c load c r j store c r j 13 CS 740 F 00 Software Pipelining cont for j 0 j SIZE j c r j a r c b r j Pipelined load b r j a r c load b r j a r c load c r j load c r j load b r j a r c load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j load b r j a r c store c r j load c r j store c r j Fill Not pipelined store c r j load b r j a r c load c r j store c r j Steady State load b r j a r c store c r j Drain load c r j store c r j 14 CS 740 F 00 Code Motion Examples Sum Integers from 1 to n Bad sum 0 for i 0 i fact n i sum i Better sum 0 fn fact n for i 0 i fn i sum i sum 0 for i fact n i 0 i sum i Best fn fact n sum fn fn 1 2 15 CS 740 F 00 Optimization Blocker Procedure Calls Why couldn t the compiler move fact n out of the inner loop Procedure May Have Side Effects i e alters global state each time called Function May Not Return Same Value for Given Arguments Depends on other parts of global state Why doesn …

View Full Document


School:
Email:
New Password:
Confirm Password:

CMU CS 15740 - Lecture

Sign up for free to view:

Please select your school