Code Optimization II:Machine Independent OptimizationsTopicsTopics Machine-Independent Optimizations Code motion Reduction in strength Common subexpression sharing Tuning Identifying performance bottlenecksSystems I2Vector ADTProceduresProceduresvec_ptr new_vec(int len) Create vector of specified lengthint get_vec_element(vec_ptr v, int index, int *dest) Retrieve vector element, store at *dest Return 0 if out of bounds, 1 if successfulint *get_vec_start(vec_ptr v) Return pointer to start of vector data Similar to array implementations in Pascal, ML, Java E.g., always do bounds checkinglengthdata • • •0 1 2 length–13Optimization ExampleProcedureProcedure Compute sum of all elements of integer vector Store result at destination location Vector data structure and operations defined via abstract data typePentium II/III Performance: Clock Cycles / ElementPentium II/III Performance: Clock Cycles / Element 42.06 (Compiled -g) 31.25 (Compiled -O2)void combine1(vec_ptr v, int *dest){ int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; }}4Reduction in StrengthOptimizationOptimization Avoid procedure call to retrieve each vector elementGet pointer to start of array before loopWithin loop just do pointer referenceNot as clean in terms of data abstraction CPE: 6.00 (Compiled -O2)Procedure calls are expensive!Bounds checking is expensivevoid combine2(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i];}5Eliminate Unneeded Memory RefsOptimizationOptimization Donʼt need to store in destination until end Local variable sum held in register Avoids 1 memory read, 1 memory write per cycle CPE: 2.00 (Compiled -O2)Memory references are expensive!void combine3(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum;}6Detecting Unneeded Memory Refs.PerformancePerformance Combine25 instructions in 6 clock cycles addl must read and write memory Combine34 instructions in 2 clock cycles.L18:movl (%ecx,%edx,4),%eaxaddl %eax,(%edi)incl %edxcmpl %esi,%edxjl .L18Combine2.L24:addl (%eax,%edx,4),%ecxincl %edxcmpl %esi,%edxjl .L24Combine37Optimization Blocker: Memory AliasingAliasingAliasing Two different memory references specify single locationExampleExample v: [3, 2, 17] combine2(v, get_vec_start(v)+2) --> ? combine3(v, get_vec_start(v)+2) --> ?ObservationsObservations Easy to have happen in CSince allowed to do address arithmeticDirect access to storage structures Get in habit of introducing local variablesAccumulating within loopsYour way of telling compiler not to check for aliasing8Previous Best Combining CodeTaskTask Compute sum of all elements in vector Vector represented by C-style abstract data type Achieved CPE of 2.00 Cycles per elementvoid combine4(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum;}9General Forms of CombiningData TypesData Types Use different declarationsfor data_t int float doublevoid abstract_combine4(vec_ptr v, data_t *dest){ int i; int length = vec_length(v); data_t *data = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP data[i]; *dest = t;}OperationsOperations Use different definitionsof OP and IDENT + / 0 * / 110Machine Independent Opt. ResultsOptimizationsOptimizations Reduce function calls and memory references within loopPerformance AnomalyPerformance Anomaly Computing FP product of all elements exceptionally slow. Very large speedup when accumulate in temporary Caused by quirk of IA32 floating point Memory uses 64-bit format, register use 80 Benchmark data caused overflow of 64 bits, but not 80Integer Floating Point Method + * + * Abstract -g 42.06 41.86 41.44 160.00 Abstract -O2 31.25 33.25 31.25 143.00 Move vec_length 20.66 21.25 21.15 135.00 data access 6.00 9.00 8.00 117.00 Accum. in temp 2.00 4.00 3.00 5.0011Pointer CodeOptimizationOptimization Use pointers rather than array references CPE: 3.00 (Compiled -O2) Oops! Weʼre not making progress here!Warning: Some compilers do better job optimizing array codevoid combine4p(vec_ptr v, int *dest){ int length = vec_length(v); int *data = get_vec_start(v); int *dend = data+length; int sum = 0; while (data < dend) { sum += *data; data++; } *dest = sum;}12Pointer vs. Array Code Inner LoopsArray CodeArray CodePointer CodePointer CodePerformancePerformance Array Code: 4 instructions in 2 clock cycles Pointer Code: Almost same 4 instructions in 3 clock cycles.L24: # Loop:addl (%eax,%edx,4),%ecx # sum += data[i]incl %edx # i++cmpl %esi,%edx # i:lengthjl .L24 # if < goto Loop.L30: # Loop:addl (%eax),%ecx # sum += *dataaddl $4,%eax # data ++cmpl %edx,%eax # data:dendjb .L30 # if < goto Loop13Machine-Independent Opt. SummaryCode MotionCode Motion Compilers are good at this for simple loop/array structures Donʼt do well in presence of procedure calls and memory aliasingReduction in StrengthReduction in Strength Shift, add instead of multiply or divide compilers are (generally) good at this Exact trade-offs machine-dependent Keep data in registers rather than memory compilers are not good at this, since concerned with aliasingShare Common Share Common SubexpressionsSubexpressions compilers have limited algebraic reasoning capabilities14Important ToolsMeasurementMeasurement Accurately compute time taken by codeMost modern machines have built in cycle countersUsing them to get reliable measurements is tricky Profile procedure calling frequenciesUnix tool gprofObservationObservation Generating assembly codeLets you see what optimizations compiler can makeUnderstand capabilities/limitations of particular compiler15Code Profiling ExampleTaskTask Count word frequencies in text document Produce sorted list of words from most frequent to leastStepsSteps Convert strings to lowercase Apply hash function Read words and insert into hash table Mostly list operations Maintain counter for each unique word Sort resultsData
View Full Document