15 213 The course that gives CMU its Zip Code Optimization I Machine Independent Optimizations Sept 26 2002 Topics n Machine Independent Optimizations l Code motion l Reduction in strength l Common subexpression sharing n Tuning l Identifying performance bottlenecks class10 ppt Great Reality 4 There s more to performance than asymptotic complexity Constant factors matter too n Easily see 10 1 performance range depending on how code is written n Must optimize at multiple levels l algorithm data representations procedures and loops Must understand system to optimize performance 2 n How programs are compiled and executed n How to measure program performance and identify bottlenecks n How to improve performance without destroying code modularity and generality 15 213 F 02 Optimizing Compilers Provide efficient mapping of program to machine n register allocation n code selection and ordering n eliminating minor inefficiencies Don t usually improve asymptotic efficiency n up to programmer to select best overall algorithm n big O savings are often more important than constant factors l but constant factors also matter Have difficulty overcoming optimization blockers 3 n potential memory aliasing n potential procedure side effects 15 213 F 02 Limitations of Optimizing Optimizing Compilers Compilers Operate Under Fundamental Constraint n Must not cause any change in program behavior under any possible condition n Often prevents it from making optimizations when would only affect behavior under pathological conditions Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles n e g data ranges may be more limited than variable types suggest Most analysis is performed only within procedures n whole program analysis is too expensive in most cases Most analysis is based only on static information n compiler has difficulty anticipating run time inputs When in doubt the compiler must be conservative 4 15 213 F 02 Machine Independent Machine Independent Optimizations Optimizations n Optimizations you should do regardless of processor compiler Code Motion n Reduce frequency with which computation performed l If it will always produce same result l Especially moving code out of loop for i 0 i n i for j 0 j n j a n i j b j 5 for i int ni for j a ni 0 i n i n i 0 j n j j b j 15 213 F 02 Compiler Generated Code Code Motion Motion n Most compilers do a good job with array code simple loop structures Code Generated by GCC for i int ni int p for j p for i 0 i n i for j 0 j n j a n i j b j imull ebx eax movl 8 ebp edi leal edi eax 4 edx Inner Loop L40 movl 12 ebp edi movl edi ecx 4 eax movl eax edx addl 4 edx incl ecx jl L40 6 0 i n i n i a ni 0 j n j b j i n a p a i n scaled by 4 b b j p p j loop scaled by 4 b j scaled by 4 if j n 15 213 F 02 Reduction in Strength Strength n Replace costly operation with simpler one n Shift add instead of multiply or divide 16 x x 4 l Utility machine dependent l Depends on cost of multiply or divide instruction l On Pentium II or III integer multiply only requires 4 CPU cycles n Recognize sequence of products for i 0 i n i for j 0 j n j a n i j b j 7 int ni 0 for i 0 i n i for j 0 j n j a ni j b j ni n 15 213 F 02 Make Use of Registers Registers n Reading and writing registers much faster than reading writing memory Limitation 8 n Compiler not always able to determine whether variable can be held in register n Possibility of Aliasing n See example later 15 213 F 02 Machine Independent Machine Independent Opts Opts Cont Cont Share Common Subexpressions n Reuse portions of expressions n Compilers often not very sophisticated in exploiting arithmetic properties Sum neighbors of i j up val i 1 n j down val i 1 n j left val i n j 1 right val i n j 1 sum up down left right 3 multiplications i n i 1 n i 1 n leal 1 edx ecx imull ebx ecx leal 1 edx eax imull ebx eax imull ebx edx 9 int inj i n up val inj down val inj left val inj right val inj sum up down j n n 1 1 left right 1 multiplication i n i 1 i 1 n i 1 i 1 n i n 15 213 F 02 Vector ADT length data 0 1 2 length 1 Procedures vec ptr new vec int len l Create vector of specified length int get vec element vec ptr v int index int dest l Retrieve vector element store at dest l Return 0 if out of bounds 1 if successful int get vec start vec ptr v l Return pointer to start of vector data n Similar to array implementations in Pascal ML Java l E g always do bounds checking 10 15 213 F 02 Optimization Example void combine1 vec ptr v int dest int i dest 0 for i 0 i vec length v i int val get vec element v i val dest val Procedure 11 n Compute sum of all elements of vector n Store result at destination location 15 213 F 02 Time Scales Absolute Time n Typically use nanoseconds l 10 9 seconds n Time scale of computer instructions Clock Cycles n Most computers controlled by high frequency clock signal n Typical Range l 100 MHz 108 cycles per second Clock period 10ns l 2 GHz 2 X 109 cycles per second Clock period 0 5ns n 12 Fish machines 550 MHz 1 8 ns clock period 15 213 F 02 Cycles Per Element n Convenient way to express performance of program that operators on vectors or lists n Length n n T CPE n Overhead 1000 900 800 vsum1 Slope 4 0 Cycles 700 600 500 vsum2 Slope 3 5 400 300 200 100 0 0 50 100 150 200 Elements 13 15 213 F 02 Optimization Example void combine1 vec ptr v int dest int i dest 0 for i 0 i vec length v i int val get vec element v i val dest val Procedure n Compute sum of all elements of integer vector n Store result at destination location n Vector data structure and operations defined via abstract data type Pentium II III Performance Clock Cycles Element 14 n 42 06 Compiled g 31 25 Compiled O2 15 213 F 02 Understanding Loop Loop void combine1 goto vec ptr v int dest int i 0 int val dest 0 if i vec length v goto done 1 iteration loop get vec element v i val dest val i if i vec length v goto loop done Inefficiency 15 n Procedure vec length called every iteration n Even though result always the same …
View Full Document