Unformatted text preview:

Systems I Code Optimization II Machine Independent Optimizations Topics Machine Independent Optimizations Code motion Reduction in strength Common subexpression sharing Tuning Identifying performance bottlenecks Vector ADT length data 0 1 2 length 1 Procedures vec ptr new vec int len Create vector of specified length int get vec element vec ptr v int index int dest Retrieve vector element store at dest Return 0 if out of bounds 1 if successful int get vec start vec ptr v Return pointer to start of vector data Similar to array implementations in Pascal ML Java E g always do bounds checking 2 Optimization Example void combine1 vec ptr v int dest int i dest 0 for i 0 i vec length v i int val get vec element v i val dest val Procedure Compute sum of all elements of integer vector Store result at destination location Vector data structure and operations defined via abstract data type Pentium II III Performance Clock Cycles Element 42 06 Compiled g 31 25 Compiled O2 3 Reduction in Strength void combine2 vec ptr v int dest int i int length vec length v int data get vec start v dest 0 for i 0 i length i dest data i Optimization Avoid procedure call to retrieve each vector element Get pointer to start of array before loop Within loop just do pointer reference Not as clean in terms of data abstraction CPE 6 00 Compiled O2 Procedure calls are expensive Bounds checking is expensive 4 Eliminate Unneeded Memory Refs void combine3 vec ptr v int dest int i int length vec length v int data get vec start v int sum 0 for i 0 i length i sum data i dest sum Optimization Don t need to store in destination until end Local variable sum held in register Avoids 1 memory read 1 memory write per cycle CPE 2 00 Compiled O2 Memory references are expensive 5 Detecting Unneeded Memory Refs Combine2 L18 Combine3 L24 movl ecx edx 4 eax addl eax edi incl edx cmpl esi edx jl L18 addl eax edx 4 ecx incl edx cmpl esi edx jl L24 Performance Combine2 5 instructions in 6 clock cycles addl must read and write memory Combine3 4 instructions in 2 clock cycles 6 Optimization Blocker Memory Aliasing Aliasing Two different memory references specify single location Example v 3 2 17 combine2 v get vec start v 2 combine3 v get vec start v 2 Observations Easy to have happen in C Since allowed to do address arithmetic Direct access to storage structures Get in habit of introducing local variables Accumulating within loops Your way of telling compiler not to check for aliasing 7 Previous Best Combining Code void combine4 vec ptr v int dest int i int length vec length v int data get vec start v int sum 0 for i 0 i length i sum data i dest sum Task Compute sum of all elements in vector Vector represented by C style abstract data type Achieved CPE of 2 00 Cycles per element 8 General Forms of Combining void abstract combine4 vec ptr v data t dest int i int length vec length v data t data get vec start v data t t IDENT for i 0 i length i t t OP data i dest t Data Types Use different declarations for data t int float double Operations Use different definitions of OP and IDENT 0 1 9 Machine Independent Opt Results Optimizations Reduce function calls and memory references within loop Method Integer Abstract g Abstract O2 Move vec length data access Accum in temp Floating Point 42 06 31 25 20 66 6 00 2 00 41 86 33 25 21 25 9 00 4 00 41 44 31 25 21 15 8 00 3 00 160 00 143 00 135 00 117 00 5 00 Performance Anomaly Computing FP product of all elements exceptionally slow Very large speedup when accumulate in temporary Caused by quirk of IA32 floating point Memory uses 64 bit format register use 80 Benchmark data caused overflow of 64 bits but not 80 10 Pointer Code void combine4p vec ptr v int dest int length vec length v int data get vec start v int dend data length int sum 0 while data dend sum data data dest sum Optimization Use pointers rather than array references CPE 3 00 Compiled O2 Oops We re not making progress here Warning Some compilers do better job optimizing array code 11 Pointer vs Array Code Inner Loops Array Code L24 addl eax edx 4 ecx incl edx cmpl esi edx jl L24 Loop sum data i i i length if goto Loop Pointer Code L30 addl eax ecx addl 4 eax cmpl edx eax jb L30 Loop sum data data data dend if goto Loop Performance Array Code 4 instructions in 2 clock cycles Pointer Code Almost same 4 instructions in 3 clock cycles 12 Machine Independent Opt Summary Code Motion Compilers are good at this for simple loop array structures Don t do well in presence of procedure calls and memory aliasing Reduction in Strength Shift add instead of multiply or divide compilers are generally good at this Exact trade offs machine dependent Keep data in registers rather than memory compilers are not good at this since concerned with aliasing Share Common Subexpressions compilers have limited algebraic reasoning capabilities 13 Important Tools Measurement Accurately compute time taken by code Most modern machines have built in cycle counters Using them to get reliable measurements is tricky Profile procedure calling frequencies Unix tool gprof Observation Generating assembly code Lets you see what optimizations compiler can make Understand capabilities limitations of particular compiler 14 Code Profiling Example Task Count word frequencies in text document Produce sorted list of words from most frequent to least Steps Convert strings to lowercase Shakespeare s most frequent words Apply hash function 29 801 the Read words and insert into hash table 27 529 and Mostly list operations 21 029 I Maintain counter for each unique word 20 957 to 18 514 of 15 370 a 14010 you Sort results Data Set Collected works of Shakespeare 12 936 my 946 596 total words 26 596 unique 11 722 in Initial implementation 9 2 seconds 11 519 that 15 Code Profiling Augment Executable Program with Timing Functions Computes approximate amount of time spent in each function Time computation method Periodically every 10ms interrupt program Determine what function is currently executing Increment its timer by interval e g 10ms Also maintains counter for each function indicating number of times called Using gcc O2 pg prog c o prog prog Executes in normal fashion but also generates file gmon out gprof prog Generates profile information based on gmon out 16 Profiling Results cumulative time seconds 86 60 8 21 5 80 8 76 4 75 9 21 1 27 9 33 self seconds 8 21 0 55 0 45 0 12 calls 1 946596 946596 946596 self ms call 8210 00 0 00 0 00 0 00 total ms call 8210 00 0 00 0 00 0 00 name sort words


View Full Document

UT CS 429H - Machine Independent Optimizations

Loading Unlocking...
Login

Join to view Machine Independent Optimizations and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Machine Independent Optimizations and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?