DOC PREVIEW
UT CS 429H - Machine Independent Optimizations

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Code Optimization II:Machine Independent OptimizationsTopicsTopics Machine-Independent Optimizations Code motion Reduction in strength Common subexpression sharing Tuning Identifying performance bottlenecksSystems I2Vector ADTProceduresProceduresvec_ptr new_vec(int len) Create vector of specified lengthint get_vec_element(vec_ptr v, int index, int *dest) Retrieve vector element, store at *dest Return 0 if out of bounds, 1 if successfulint *get_vec_start(vec_ptr v) Return pointer to start of vector data Similar to array implementations in Pascal, ML, Java E.g., always do bounds checkinglengthdata • • •0 1 2 length–13Optimization ExampleProcedureProcedure Compute sum of all elements of integer vector Store result at destination location Vector data structure and operations defined via abstract data typePentium II/III Performance: Clock Cycles / ElementPentium II/III Performance: Clock Cycles / Element 42.06 (Compiled -g) 31.25 (Compiled -O2)void combine1(vec_ptr v, int *dest){ int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; }}4Reduction in StrengthOptimizationOptimization Avoid procedure call to retrieve each vector elementGet pointer to start of array before loopWithin loop just do pointer referenceNot as clean in terms of data abstraction CPE: 6.00 (Compiled -O2)Procedure calls are expensive!Bounds checking is expensivevoid combine2(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i];}5Eliminate Unneeded Memory RefsOptimizationOptimization Donʼt need to store in destination until end Local variable sum held in register Avoids 1 memory read, 1 memory write per cycle CPE: 2.00 (Compiled -O2)Memory references are expensive!void combine3(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum;}6Detecting Unneeded Memory Refs.PerformancePerformance Combine25 instructions in 6 clock cycles addl must read and write memory Combine34 instructions in 2 clock cycles.L18:movl (%ecx,%edx,4),%eaxaddl %eax,(%edi)incl %edxcmpl %esi,%edxjl .L18Combine2.L24:addl (%eax,%edx,4),%ecxincl %edxcmpl %esi,%edxjl .L24Combine37Optimization Blocker: Memory AliasingAliasingAliasing Two different memory references specify single locationExampleExample v: [3, 2, 17] combine2(v, get_vec_start(v)+2) --> ? combine3(v, get_vec_start(v)+2) --> ?ObservationsObservations Easy to have happen in CSince allowed to do address arithmeticDirect access to storage structures Get in habit of introducing local variablesAccumulating within loopsYour way of telling compiler not to check for aliasing8Previous Best Combining CodeTaskTask Compute sum of all elements in vector Vector represented by C-style abstract data type Achieved CPE of 2.00 Cycles per elementvoid combine4(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum;}9General Forms of CombiningData TypesData Types Use different declarationsfor data_t int float doublevoid abstract_combine4(vec_ptr v, data_t *dest){ int i; int length = vec_length(v); data_t *data = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP data[i]; *dest = t;}OperationsOperations Use different definitionsof OP and IDENT + / 0 * / 110Machine Independent Opt. ResultsOptimizationsOptimizations Reduce function calls and memory references within loopPerformance AnomalyPerformance Anomaly Computing FP product of all elements exceptionally slow. Very large speedup when accumulate in temporary Caused by quirk of IA32 floating point Memory uses 64-bit format, register use 80 Benchmark data caused overflow of 64 bits, but not 80Integer Floating Point Method + * + * Abstract -g 42.06 41.86 41.44 160.00 Abstract -O2 31.25 33.25 31.25 143.00 Move vec_length 20.66 21.25 21.15 135.00 data access 6.00 9.00 8.00 117.00 Accum. in temp 2.00 4.00 3.00 5.0011Pointer CodeOptimizationOptimization Use pointers rather than array references CPE: 3.00 (Compiled -O2) Oops! Weʼre not making progress here!Warning: Some compilers do better job optimizing array codevoid combine4p(vec_ptr v, int *dest){ int length = vec_length(v); int *data = get_vec_start(v); int *dend = data+length; int sum = 0; while (data < dend) { sum += *data; data++; } *dest = sum;}12Pointer vs. Array Code Inner LoopsArray CodeArray CodePointer CodePointer CodePerformancePerformance Array Code: 4 instructions in 2 clock cycles Pointer Code: Almost same 4 instructions in 3 clock cycles.L24: # Loop:addl (%eax,%edx,4),%ecx # sum += data[i]incl %edx # i++cmpl %esi,%edx # i:lengthjl .L24 # if < goto Loop.L30: # Loop:addl (%eax),%ecx # sum += *dataaddl $4,%eax # data ++cmpl %edx,%eax # data:dendjb .L30 # if < goto Loop13Machine-Independent Opt. SummaryCode MotionCode Motion Compilers are good at this for simple loop/array structures Donʼt do well in presence of procedure calls and memory aliasingReduction in StrengthReduction in Strength Shift, add instead of multiply or divide compilers are (generally) good at this Exact trade-offs machine-dependent Keep data in registers rather than memory compilers are not good at this, since concerned with aliasingShare Common Share Common SubexpressionsSubexpressions compilers have limited algebraic reasoning capabilities14Important ToolsMeasurementMeasurement Accurately compute time taken by codeMost modern machines have built in cycle countersUsing them to get reliable measurements is tricky Profile procedure calling frequenciesUnix tool gprofObservationObservation Generating assembly codeLets you see what optimizations compiler can makeUnderstand capabilities/limitations of particular compiler15Code Profiling ExampleTaskTask Count word frequencies in text document Produce sorted list of words from most frequent to leastStepsSteps Convert strings to lowercase Apply hash function Read words and insert into hash table Mostly list operations Maintain counter for each unique word Sort resultsData


View Full Document

UT CS 429H - Machine Independent Optimizations

Download Machine Independent Optimizations
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Machine Independent Optimizations and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Machine Independent Optimizations 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?