Machine-Dependent OptimizationMachine-Dependent OptimizationModern CPU DesignCPU Capabilities of Pentium IIIWhat Is a Pipeline?Instruction ControlTranslation ExampleTranslation Example #1Translation Example #2Translation Example #3Translation Example #4Visualizing OperationsVisualizing Operations (cont.)3 Iterations of Combining Product4 Iterations of Combining SumCombining Sum: Resource ConstraintsLoop UnrollingVisualizing Unrolled LoopExecuting with Loop UnrollingEffect of UnrollingDuff’s DeviceSerial ComputationParallel Loop UnrollingDual Product ComputationRequirements for Parallel ComputationVisualizing Parallel LoopExecuting with Parallel LoopParallel Unrolling: Method #2Method #2 ComputationUnderstanding ParallelismLimitations of Parallel ExecutionRegister Spilling ExampleSummary: Results for Pentium IIIResults for Pentium 4New Topic: What About Branches?Branch OutcomesBranch PredictionBranch Prediction Through LoopBranch Misprediction InvalidationBranch Misprediction RecoveryAvoiding BranchesAvoiding Branches with Bit TricksSlide 43Conditional MoveMachine-Dependent Opt. SummaryRole of ProgrammerMachine-Dependent Optimization Machine-Dependent Optimization CS 105“Tour of the Black Holes of Computing”– 2 –CS 105Machine-Dependent OptimizationMachine-Dependent OptimizationNeed to understand the architectureNeed to understand the architectureNot portableNot portableNot often neededNot often neededbut critically important when it isbut critically important when it isAlso helps in understanding modern machinesAlso helps in understanding modern machines– 3 –CS 105Modern CPU DesignModern CPU DesignExecutionExecutionFunctionalUnitsInstruction ControlInstruction ControlInteger/BranchFPAddFPMult/DivLoad StoreInstructionCacheDataCacheFetchControlInstructionDecodeAddressInstrs.OperationsPrediction OK?DataDataAddr.Addr.GeneralIntegerOperation ResultsRetirementUnitRegisterFileRegister Updates– 4 –CS 105CPU Capabilities of Pentium IIICPU Capabilities of Pentium IIIMultiple Instructions Can Execute in ParallelMultiple Instructions Can Execute in Parallel1 load1 store2 integer (one may be branch)1 FP Addition1 FP Multiplication or DivisionSome Instructions Take > 1 Cycle, But Can Be PipelinedSome Instructions Take > 1 Cycle, But Can Be PipelinedInstruction Latency Cycles/IssueLoad / Store 3 1Integer Multiply 4 1Integer Divide 36 36Double/Single FP Multiply 5 2Double/Single FP Add 3 1Double/Single FP Divide 38 38– 5 –CS 105What Is a Pipeline?What Is a Pipeline?AddAddAddAddAddAddAddAddAddAddAddResultBucket– 6 –CS 105Instruction ControlInstruction ControlGrabs instruction bytes from memoryGrabs instruction bytes from memoryBased on current PC + targets for predicted branchesHardware dynamically guesses whether branches taken/not taken and (possibly) branch targetTranslates instructions into Translates instructions into operationsoperations (unique x86 feature) (unique x86 feature)Primitive steps required to perform instructionTypical instruction requires 1–3 operationsConverts register references into Converts register references into tagstagsAbstract identifier linking destination of one operation with sources of later operationsInstruction ControlInstruction ControlInstructionCacheFetchControlInstructionDecodeAddressInstrs.OperationsRetirementUnitRegisterFile– 7 –CS 105Translation ExampleTranslation ExampleVersion of Combine4Version of Combine4Integer data, multiply operationTranslation of First IterationTranslation of First Iteration.L24: # Loop:imull (%eax,%edx,4),%ecx # t *= data[i]incl %edx # i++cmpl %esi,%edx # i:lengthjl .L24 # if < goto Loop.L24:imull (%eax,%edx,4),%ecxincl %edxcmpl %esi,%edxjl .L24load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1incl %edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1– 8 –CS 105Translation Example #1Translation Example #1Split into two operations load reads from m emory to generate temporar y result t.1Multiply operation just operates on registersOperandsRegister %eax does not change in loop. Values will be retrieved from reg ister file during decodingRegister %ecx change s on every iter ation. Uniquely iden tify different versions as %ec x.0, %ecx.1, %ecx.2, …» R egister renami ng»Values passe d directly from producer to consume rs imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1– 9 –CS 105Translation Example #2Translation Example #2Register %edx changes on each iteration. Rename as %edx.0, %edx.1, %edx.2, …incl %edx incl %edx.0 %edx.1– 10 –CS 105Translation Example #3Translation Example #3Condition codes are treated similarlyAssign tag to define connection between producer and consumercmpl %esi,%edx cmpl %esi, %edx.1 cc.1– 11 –CS 105Translation Example #4Translation Example #4Instruction control unit determines destination of jumpPredicts whether will be taken and targetStarts fetching instruction at predicted destinationExecution unit simply checks whether or not prediction was OKIf not, it signals instruction controlInstruction control then “invalidates” any operations generated fr om misfetched instructionsBegins fetching and decoding instructi ons at corr ect targetjl .L24 jl-taken cc.1– 12 –CS 105Visualizing OperationsVisualizing OperationsOperationsOperationsVertical position denotes time at which executedCannot begin operation until operands availableHeight denotes latencyOperandsOperandsArcs shown only for operands that are passed within execution unitcc.1t.1load%ecx.1inclcmpljl%edx.0%edx.1%ecx.0imullload (%eax,%edx,4) t.1imull t.1, %ecx.0 %ecx.1incl %edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1Time– 13 –CS 105Visualizing Operations (cont.)Visualizing Operations (cont.)OperationsOperationsSame as before, except that add has latency of 1Timecc.1t.1%ecx.i +1inclcmpljlload%edx.0%edx.1%ecx.0addl%ecx.1loadload (%eax,%edx,4) t.1addl t.1, %ecx.0 %ecx.1incl %edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1– 14 –CS 105cc.1cc.2%ecx.0%edx.3t.1imull%ecx.1inclcmpljl%edx.0i=0loadt.2imull%ecx.2inclcmpljl%edx.1i=1loadcc.3t.3imull%ecx.3inclcmpljl%edx.2i=2loadCycle123456789101112131415cc.1cc.2Iteration 3Iteration 2Iteration
View Full Document