Code Optimization II September 26, 2007Getting High PerformanceModern CPU DesignCPU Capabilities of Pentium IVInstruction ControlTranslating into OperationsTraditional View of Instruction ExecutionDataflow View of Instruction ExecutionExample ComputationCycles Per Elementx86-64 Compilation of Combine4Serial ComputationLoop UnrollingEffect of Loop UnrollingLoop Unrolling with ReassociationEffect of ReassociationReassociated ComputationLoop Unrolling with Separate Accum.Slide 19Separate Accum. ComputationUnrolling & AccumulatingUnrolling & Accumulating: Intel FP *Unrolling & Accumulating: Intel FP +Unrolling & Accumulating: Intel Int *Unrolling & Accumulating: Intel Int +Intel vs. AMD FP *Intel vs. AMD Int *Intel vs. AMD Int +Can We Go Faster?Programming with SSE3Scalar & SIMD OperationsWhat About Branches?Branch OutcomesBranch PredictionBranch Prediction Through LoopBranch Misprediction InvalidationBranch Misprediction RecoveryAvoiding Branches: cmovDetermining Misprediction PenaltyForcing ConditionalTesting MethodologyTesting OutcomesRole of ProgrammerCode Optimization IISeptember 26, 2007Code Optimization IISeptember 26, 2007TopicsTopicsMachine Dependent OptimizationsUnderstanding Processor OperationsBranches and Branch Predictionclass09.ppt15-21315-213 F’07– 2 –15-213: Intro to Computer SystemsFall 2007 ©Getting High PerformanceGetting High PerformanceDon’t Do Anything StupidDon’t Do Anything StupidWatch out for hidden algorithmic inefficienciesWrite compiler-friendly codeHelp compiler past optimization blockers: function calls & memory refs.Tune Code For MachineTune Code For MachineExploit instruction-level parallelismAvoid unpredictable branchesMake code cache friendlyCovered later in course– 3 –15-213: Intro to Computer SystemsFall 2007 ©Modern CPU DesignModern CPU DesignExecutionExecutionFunctionalUnitsInstruction ControlInstruction ControlInteger/BranchFPAddFPMult/DivLoad StoreInstructionCacheDataCacheFetchControlInstructionDecodeAddressInstrs.OperationsPrediction OK?DataDataAddr.Addr.GeneralIntegerOperation ResultsRetirementUnitRegisterFileRegister Updates– 4 –15-213: Intro to Computer SystemsFall 2007 ©CPU Capabilities of Pentium IVCPU Capabilities of Pentium IVMultiple Instructions Can Execute in ParallelMultiple Instructions Can Execute in Parallel1 load, with address computation1 store, with address computation2 simple integer (one may be branch)1 complex integer (multiply/divide)1 FP/SSE3 unit1 FP move (does all conversions)Some Instructions Take > 1 Cycle, but Can be PipelinedSome Instructions Take > 1 Cycle, but Can be PipelinedInstruction LatencyCycles/IssueLoad / Store 51Integer Multiply 101Integer/Long Divide 36/10636/106Single/Double FP Multiply 72Single/Double FP Add 52Single/Double FP Divide 32/4632/46– 5 –15-213: Intro to Computer SystemsFall 2007 ©Instruction ControlInstruction ControlGrabs Instruction Bytes From MemoryGrabs Instruction Bytes From MemoryBased on current PC + predicted targets for predicted branchesHardware dynamically guesses whether branches taken/not taken and (possibly) branch targetTranslates Instructions Into Translates Instructions Into Operations Operations (for CISC style CPUs)(for CISC style CPUs)Primitive steps required to perform instructionTypical instruction requires 1–3 operationsConverts Register References Into Converts Register References Into TagsTagsAbstract identifier linking destination of one operation with sources of later operationsInstruction ControlInstruction ControlInstructionCacheFetchControlInstructionDecodeAddressInstrs.OperationsRetirementUnitRegisterFile– 6 –15-213: Intro to Computer SystemsFall 2007 ©Translating into OperationsTranslating into OperationsGoal: Each Operation Utilizes Single Functional UnitGoal: Each Operation Utilizes Single Functional UnitRequires: Load, Integer arithmetic, StoreExact form and format of o perations is trade secretOperations: split up instruction into simpler piecesDevise temporary names to describe how result of one operation gets used by other operations addq %rax, 8(%rbx,%rdx,4)load 8(%rbx,%rdx,4) temp1imull %rax, temp1 temp2store temp2, 8(%rbx,%rdx,4)– 7 –15-213: Intro to Computer SystemsFall 2007 ©Traditional View of Instruction ExecutionTraditional View of Instruction ExecutionImperative ViewImperative ViewRegisters are fixed storage locationsIndividual instructions read & write themInstructions must be executed in specified sequence to guarantee proper program behavior addq %rax, %rbx # I1 andq %rbx, %rdx # I2 mulq %rcx, %rbx # I3 xorq %rbx, %rdi # I4rax+rbx rdx rcx rdiI1I2I3I4&*^– 8 –15-213: Intro to Computer SystemsFall 2007 ©Dataflow View of Instruction ExecutionDataflow View of Instruction ExecutionFunctional ViewFunctional ViewView each write as creating new instance of valueOperations can be performed as soon as operands availableNo need to execute in original sequence addq %rax, %rbx # I1 andq %rbx, %rdx # I2 mulq %rcx, %rbx # I3 xorq %rbx, %rdi # I4rax.0+rbx.0 rdx.0 rcx.0 rdi.0I1I2/I3I4&*^rbx.1rdx.1rbx.2rdi.0– 9 –15-213: Intro to Computer SystemsFall 2007 ©Example ComputationExample ComputationData TypesData TypesUse different declarations for data_tintfloatdoublevoid combine4(vec_ptr v, data_t *dest){ int i; int length = vec_length(v); data_t *d = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP d[i]; *dest = t;}OperationsOperationsUse different definitions of OP and IDENT + / 0 * / 1– 10 –15-213: Intro to Computer SystemsFall 2007 ©Cycles Per ElementCycles Per ElementConvenient way to express performance of program that operators on vectors or listsLength = nT = CPE*n + Overhead010020030040050060070080090010000 50 100 150 200ElementsCyclesvsum1Slope = 4.0 vsum2Slope = 3.5– 11 –15-213: Intro to Computer SystemsFall 2007 ©x86-64 Compilation of Combine4x86-64 Compilation of Combine4Inner Loop (Integer Multiply)Inner Loop (Integer Multiply)PerformancePerformance5 instructions in 2 clock cyclesL33: # Loop:movl (%eax,%edx,4), %ebx # temp = d[i]incl %edx # i++imull %ebx, %ecx # x *= tempcmpl %esi, %edx # i:lengthjl L33 # if < goto LoopMethod Integer Floating PointCombine4 2.20 10.00 5.00 7.00– 12 –15-213: Intro to
View Full Document