Code Optimization II Dec. 2, 2008Getting High PerformanceMotivate: Compute FactorialsFaster Versions (1)Faster Versions (2)Faster Versions (3)Modern CPU DesignPentium IV Nocona CPUCPUs: Nocona vs. Core 2Instruction ControlTranslating into OperationsTraditional View of Instruction ExecutionDataflow View of Instruction ExecutionExample ComputationCycles Per Elementx86-64 Compilation of Combine4Serial ComputationLoop UnrollingEffect of Loop UnrollingLoop Unrolling with ReassociationEffect of ReassociationReassociated ComputationLoop Unrolling with Separate Accum.Slide 24Separate Accum. ComputationUnrolling & AccumulatingUnrolling & Accumulating: Intel FP *Unrolling & Accumulating: Intel FP +Unrolling & Accumulating: Intel Int *Unrolling & Accumulating: Intel Int +Nocona vs. Core 2 FP *Nocona vs. Core 2 Int *Intel vs. AMD FP *Intel vs. AMD Int *Intel vs. AMD Int +Can We Go Faster?Programming with SSE3Scalar & SIMD OperationsGetting GCC to Use SIMD OperationsImplementing CombineGetting StartedSIMD LoopCompletionSIMD ResultsWhat About Branches?Branch OutcomesBranch PredictionBranch Prediction Through LoopBranch Misprediction InvalidationBranch Misprediction RecoveryDetermining Misprediction PenaltyForcing ConditionalTesting MethodologyTesting OutcomesRole of ProgrammerFact RevisitedPreventing ReassociationCode Optimization IIDec. 2, 2008Code Optimization IIDec. 2, 2008TopicsTopicsMachine Dependent OptimizationsUnderstanding Processor OperationsBranches and Branch Predictionclass26.ppt15-213“The course that gives CMU its Zip!”– 2 –15-213, F’08Getting High PerformanceGetting High PerformanceDon’t Do Anything StupidDon’t Do Anything StupidWatch out for hidden algorithmic inefficienciesWrite compiler-friendly codeHelp compiler past optimization blockers: function calls & memory refs.Tune Code For MachineTune Code For MachineExploit instruction-level parallelismAvoid unpredictable branchesMake code cache friendlyCovered later in course– 3 –15-213, F’08Motivate: Compute FactorialsMotivate: Compute FactorialsMachinesMachinesIntel Pentium 4 Nocona, 3.2 GHzFish MachinesIntel Core 2, 2.7 GHzCompiler VersionsCompiler VersionsGCC 3.4.2 (current on Fish machines)GCC 4.1.2 (most recent available)int rfact(int n){ if (n <= 1)return 1; return n * rfact(n-1);}int fact(int n){ int i; int result = 1; for (i = n; i > 0; i--)result = result * i; return result;}Machine Nocona Core 2Compiler 3.4 3.4 4.1rfact15.5 6.0 3.0fact10.0 3.0 3.0Cycles Per ElementCycles Per Element– 4 –15-213, F’08Faster Versions (1)Faster Versions (1)Loop UnrollingLoop UnrollingCompute more values per iterationDoes not help hereint fact_u3a(int n){ int i; int result = 1; for (i = n; i >= 3; i-=3) {result = result * i * (i-1) * (i-2); } for (; i > 0; i--)result *= i; return result;}Machine Nocona Core 2Compiler 3.4 3.4 4.1rfact15.5 6.0 3.0fact10.0 3.0 3.0fact_u3a10.0 3.0 3.0Cycles Per ElementCycles Per Element– 5 –15-213, F’08Faster Versions (2)Faster Versions (2)Loop Unrolling + ReassociationLoop Unrolling + Reassociation~3X drop for GCC 3.4No improvement for GCC 4.1Very strange!int fact_u3b(int n){ int i; int result = 1; for (i = n; i >= 3; i-=3) {result = result * (i * (i-1) * (i-2)); } for (; i > 0; i--)result *= i; return result;}Machine Nocona Core 2Compiler 3.4 3.4 4.1rfact15.5 6.0 3.0fact10.0 3.0 3.0fact_u3a10.0 3.0 3.0fact_u3b3.3 1.0 3.0Cycles Per ElementCycles Per Element– 6 –15-213, F’08Faster Versions (3)Faster Versions (3)Loop Unrolling + Multiple AccumulatorsLoop Unrolling + Multiple Accumulators~3X drop for all machinesint fact_u3c(int n){ int i; int result0 = 1; int result1 = 1; int result2 = 1; for (i = n; i >= 3; i-=3) {result0 *= i;result1 *= (i-1);result2 *= (i-2); } for (; i > 0; i--)result0 *= i; return result0 * result1 * result2;}Machine Nocona Core 2Compiler 3.4 3.4 4.1rfact15.5 6.0 3.0fact10.0 3.0 3.0fact_u3a10.0 3.0 3.0fact_u3b3.3 1.0 3.0fact_u3c3.3 1.0 1.0Cycles Per ElementCycles Per Element– 7 –15-213, F’08Modern CPU DesignModern CPU DesignExecutionExecutionFunctionalUnitsInstruction ControlInstruction ControlInteger/BranchFPAddFPMult/DivLoad StoreInstructionCacheDataCacheFetchControlInstructionDecodeAddressInstrs.OperationsPrediction OK?DataDataAddr.Addr.GeneralIntegerOperation ResultsRetirementUnitRegisterFileRegister Updates– 8 –15-213, F’08Pentium IV Nocona CPUPentium IV Nocona CPUMultiple Instructions Can Execute in ParallelMultiple Instructions Can Execute in Parallel1 load, with address computation1 store, with address computation2 simple integer (one may be branch)1 complex integer (multiply/divide)1 FP/SSE3 unit1 FP move (does all conversions)Some Instructions Take > 1 Cycle, but Can be PipelinedSome Instructions Take > 1 Cycle, but Can be PipelinedInstruction LatencyCycles/IssueLoad / Store 51Integer Multiply 101Integer/Long Divide 36/10636/106Single/Double FP Multiply 72Single/Double FP Add 52Single/Double FP Divide 32/4632/46– 9 –15-213, F’08CPUs: Nocona vs. Core 2CPUs: Nocona vs. Core 2Nocona (3.2 GHz) (Saltwater fish machines)Nocona (3.2 GHz) (Saltwater fish machines)Instruction LatencyCycles/IssueLoad / Store 101Integer Multiply 101Integer/Long Divide 36/10636/106Single/Double FP Multiply 72Single/Double FP Add 52Single/Double FP Divide 32/4632/46Core 2 (2.7 GHz) (Recent Intel microprocessors)Core 2 (2.7 GHz) (Recent Intel microprocessors)Load / Store 51Integer Multiply 31Integer/Long Divide 18/5018/50Single/Double FP Multiply 4/51Single/Double FP Add 31Single/Double FP Divide 18/3218/32– 10 –15-213, F’08Instruction ControlInstruction ControlGrabs Instruction Bytes From MemoryGrabs Instruction Bytes From MemoryBased on current PC + predicted targets for predicted branchesHardware dynamically guesses whether branches taken/not taken and (possibly) branch targetTranslates Instructions Into Translates Instructions Into Operations Operations (for CISC style CPUs)(for CISC style CPUs)Primitive steps required to perform instructionTypical instruction requires 1–3 operationsConverts Register References Into Converts Register References Into TagsTagsAbstract identifier linking destination of one operation with sources of later operationsInstruction ControlInstruction
View Full Document