Code Optimization IISeptember 27, 2006Code Optimization IISeptember 27, 2006TopicsTopics Machine Dependent Optimizationsz Understanding Processor Operationsz Branches and Branch Predictionclass09.ppt15-213“The course that gives CMU its Zip!”–2–15-213, F’06Getting High PerformanceGetting High PerformanceDonDon’’t Do Anything Stupidt Do Anything Stupid Watch out for hidden algorithmic inefficiencies Write compiler-friendly codez Help compiler past optimization blockers: function calls & memory refs.Tune Code For MachineTune Code For Machine Exploit instruction-level parallelism Avoid unpredictable branches Make code cache friendlyz Covered later in course–3–15-213, F’06Modern CPU DesignModern CPU DesignExecutionExecutionFunctionalUnitsInstruction ControlInstruction ControlInteger/BranchFPAddFPMult/DivLoad StoreInstructionCacheDataCacheFetchControlInstructionDecodeAddressInstrs.OperationsPrediction OK?DataDataAddr.Addr.GeneralIntegerOperation ResultsRetirementUnitRegisterFileRegister Updates–4–15-213, F’06CPU Capabilities of Pentium IVCPU Capabilities of Pentium IVMultiple Instructions Can Execute in ParallelMultiple Instructions Can Execute in Parallel 1 load, with address computation 1 store, with address computation 2 simple integer (one may be branch) 1 complex integer (multiply/divide) 1 FP/SSE3 unit 1 FP move (does all conversions)Some Instructions Take > 1 Cycle, but Can be PipelinedSome Instructions Take > 1 Cycle, but Can be Pipelined Instruction Latency Cycles/Issue Load / Store 5 1 Integer Multiply 10 1 Integer/Long Divide 36/106 36/106 Single/Double FP Multiply 7 2 Single/Double FP Add 5 2 Single/Double FP Divide 32/46 32/46–5–15-213, F’06Instruction ControlInstruction ControlGrabs Instruction Bytes From MemoryGrabs Instruction Bytes From Memory Based on current PC + predicted targets for predicted branches Hardware dynamically guesses whether branches taken/not taken and (possibly) branch targetTranslates Instructions Into Translates Instructions Into Operations Operations (for CISC style CPUs)(for CISC style CPUs) Primitive steps required to perform instruction Typical instruction requires 1–3 operationsConverts Register References Into Converts Register References Into TagsTags Abstract identifier linking destination of one operation with sources of later operationsInstruction ControlInstruction ControlInstructionCacheFetchControlInstructionDecodeAddressInstrs.OperationsRetirementUnitRegisterFile–6–15-213, F’06Translating into OperationsTranslating into OperationsGoal: Each Operation Utilizes Single Functional UnitGoal: Each Operation Utilizes Single Functional Unit Requires: Load, Integer arithmetic, Store Exact form and format of operations is trade secret Operations: split up instruction into simpler pieces Devise temporary names to describe how result of one operation gets used by other operationsaddq %rax, 8(%rbx,%rdx,4)load 8(%rbx,%rdx,4) Î temp1imull %rax, temp1 Î temp2store temp2, 8(%rbx,%rdx,4)–7–15-213, F’06Traditional View of Instruction ExecutionTraditional View of Instruction ExecutionImperative ViewImperative View Registers are fixed storage locationsz Individual instructions read & write them Instructions must be executed in specified sequence to guarantee proper program behavioraddq %rax, %rbx # I1andq %rbx, %rdx # I2mulq %rcx, %rbx # I3xorq %rbx, %rdi # I4rax+rbx rdx rcx rdiI1I2I3I4&*^–8–15-213, F’06Dataflow View of Instruction ExecutionDataflow View of Instruction ExecutionFunctional ViewFunctional View View each write as creating new instance of value Operations can be performed as soon as operands available No need to execute in original sequenceaddq %rax, %rbx # I1andq %rbx, %rdx # I2mulq %rcx, %rbx # I3xorq %rbx, %rdi # I4rax.0+rbx.0 rdx.0 rcx.0 rdi.0I1I2/I3I4&*^rbx.1rdx.1rbx.2rdi.0–9–15-213, F’06Example ComputationExample ComputationData TypesData Types Use different declarations for data_t int float doublevoid combine4(vec_ptr v, data_t *dest){int i;int length = vec_length(v);data_t *d = get_vec_start(v);data_t t = IDENT;for (i = 0; i < length; i++)t = t OP d[i];*dest = t;}OperationsOperations Use different definitions of OP and IDENT + / 0 * / 1–10–15-213, F’06Cycles Per ElementCycles Per Element Convenient way to express performance of program that operators on vectors or lists Length = n T = CPE*n + Overhead010020030040050060070080090010000 50 100 150 200Ele mentsCyclesvsum1Slope = 4.0vsum2Slope = 3.5–11–15-213, F’06x86-64 Compilation of Combine4x86-64 Compilation of Combine4Inner Loop (Integer Multiply)Inner Loop (Integer Multiply)PerformancePerformance 5 instructions in 2 clock cyclesL33: # Loop:movl (%eax,%edx,4), %ebx # temp = d[i]incl %edx # i++imull %ebx, %ecx # x *= tempcmpl %esi, %edx # i:lengthjl L33 # if < goto Loop7.005.0010.002.20Combine4Floating PointIntegerMethod–12–15-213, F’06Serial ComputationSerial ComputationComputation (length=12)Computation (length=12)((((((((((((1 * d[0]) * d[1]) * d[2]) * d[3]) * d[4]) * d[5]) * d[6]) * d[7]) * d[8]) * d[9]) * d[10]) * d[11])PerformancePerformance N elements, D cycles/operation N*D cycles **11dd00dd11*dd22*dd33*dd44*dd55*dd66*dd77*dd88*dd99*dd1010*dd11117.005.0010.002.20Combine4Floating PointIntegerMethod–13–15-213, F’06Loop UnrollingLoop Unrolling Perform 2x more useful work per iterationvoid unroll2a_combine(vec_ptr v, data_t *dest){int length = vec_length(v);int limit = length-1;data_t *d = get_vec_start(v);data_t x = IDENT;int i;/* Combine 2 elements at a time */for (i = 0; i < limit; i+=2) {x = (x OPER d[i]) OPER d[i+1];}/* Finish any remaining elements */for (; i < length; i++) {x = x OPER d[i];}*dest = x;}–14–15-213, F’067.005.0010.001.50Unroll 27.005.0010.002.20Combine4Floating PointIntegerMethodEffect of Loop UnrollingEffect of Loop UnrollingHelps Integer SumHelps Integer Sum Before: 5 operations per element After: 6 operations per 2 elements z = 3 operations per elementOthers DonOthers Don’’t Improvet Improve Sequential dependencyz Each operation must wait until previous one completesx = (x OPER d[i]) OPER d[i+1];–15–15-213, F’06Loop Unrolling with ReassociationLoop Unrolling with Reassociation Could change numerical results for FPvoid unroll2aa_combine(vec_ptr v, data_t *dest){int
View Full Document