Harvey Mudd CS 105 - Machine-Dependent Optimization - D2430319

Home> Schools> Harvey Mudd College> (CS) > CS 105> Machine-Dependent Optimization

DOC PREVIEW

Harvey Mudd CS 105 - Machine-Dependent Optimization

School name Harvey Mudd College

Course Cs 105- Computer Systems

Pages 46

This preview shows page 1-2-3-22-23-24-44-45-46 out of 46 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 46 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Machine-Dependent OptimizationMachine-Dependent OptimizationModern CPU DesignCPU Capabilities of Pentium IIIWhat Is a Pipeline?Instruction ControlTranslation ExampleTranslation Example #1Translation Example #2Translation Example #3Translation Example #4Visualizing OperationsVisualizing Operations (cont.)3 Iterations of Combining Product4 Iterations of Combining SumCombining Sum: Resource ConstraintsLoop UnrollingVisualizing Unrolled LoopExecuting with Loop UnrollingEffect of UnrollingDuff’s DeviceSerial ComputationParallel Loop UnrollingDual Product ComputationRequirements for Parallel ComputationVisualizing Parallel LoopExecuting with Parallel LoopParallel Unrolling: Method #2Method #2 ComputationUnderstanding ParallelismLimitations of Parallel ExecutionRegister Spilling ExampleSummary: Results for Pentium IIIResults for Pentium 4New Topic: What About Branches?Branch OutcomesBranch PredictionBranch Prediction Through LoopBranch Misprediction InvalidationBranch Misprediction RecoveryAvoiding BranchesAvoiding Branches with Bit TricksSlide 43Conditional MoveMachine-Dependent Opt. SummaryRole of ProgrammerMachine-Dependent Optimization Machine-Dependent Optimization CS 105“Tour of the Black Holes of Computing”– 2 –CS 105Machine-Dependent OptimizationMachine-Dependent OptimizationNeed to understand the architectureNeed to understand the architectureNot portableNot portableNot often neededNot often neededbut critically important when it isbut critically important when it isAlso helps in understanding modern machinesAlso helps in understanding modern machines– 3 –CS 105Modern CPU DesignModern CPU DesignExecutionExecutionFunctionalUnitsInstruction ControlInstruction ControlInteger/BranchFPAddFPMult/DivLoad StoreInstructionCacheDataCacheFetchControlInstructionDecodeAddressInstrs.OperationsPrediction OK?DataDataAddr.Addr.GeneralIntegerOperation ResultsRetirementUnitRegisterFileRegister Updates– 4 –CS 105CPU Capabilities of Pentium IIICPU Capabilities of Pentium IIIMultiple Instructions Can Execute in ParallelMultiple Instructions Can Execute in Parallel1 load1 store2 integer (one may be branch)1 FP Addition1 FP Multiplication or DivisionSome Instructions Take > 1 Cycle, But Can Be PipelinedSome Instructions Take > 1 Cycle, But Can Be PipelinedInstruction Latency Cycles/IssueLoad / Store 3 1Integer Multiply 4 1Integer Divide 36 36Double/Single FP Multiply 5 2Double/Single FP Add 3 1Double/Single FP Divide 38 38– 5 –CS 105What Is a Pipeline?What Is a Pipeline?AddAddAddAddAddAddAddAddAddAddAddResultBucket– 6 –CS 105Instruction ControlInstruction ControlGrabs instruction bytes from memoryGrabs instruction bytes from memoryBased on current PC + targets for predicted branchesHardware dynamically guesses whether branches taken/not taken and (possibly) branch targetTranslates instructions into Translates instructions into operationsoperations (unique x86 feature) (unique x86 feature)Primitive steps required to perform instructionTypical instruction requires 1–3 operationsConverts register references into Converts register references into tagstagsAbstract identifier linking destination of one operation with sources of later operationsInstruction ControlInstruction ControlInstructionCacheFetchControlInstructionDecodeAddressInstrs.OperationsRetirementUnitRegisterFile– 7 –CS 105Translation ExampleTranslation ExampleVersion of Combine4Version of Combine4Integer data, multiply operationTranslation of First IterationTranslation of First Iteration.L24: # Loop:imull (%eax,%edx,4),%ecx # t *= data[i]incl %edx # i++cmpl %esi,%edx # i:lengthjl .L24 # if < goto Loop.L24:imull (%eax,%edx,4),%ecxincl %edxcmpl %esi,%edxjl .L24load (%eax,%edx.0,4)  t.1imull t.1, %ecx.0  %ecx.1incl %edx.0  %edx.1cmpl %esi, %edx.1  cc.1jl-taken cc.1– 8 –CS 105Translation Example #1Translation Example #1Split into two operations load reads from m emory to generate temporar y result t.1Multiply operation just operates on registersOperandsRegister %eax does not change in loop. Values will be retrieved from reg ister file during decodingRegister %ecx change s on every iter ation. Uniquely iden tify different versions as %ec x.0, %ecx.1, %ecx.2, …» R egister renami ng»Values passe d directly from producer to consume rs imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4)  t.1imull t.1, %ecx.0  %ecx.1– 9 –CS 105Translation Example #2Translation Example #2Register %edx changes on each iteration. Rename as %edx.0, %edx.1, %edx.2, …incl %edx incl %edx.0  %edx.1– 10 –CS 105Translation Example #3Translation Example #3Condition codes are treated similarlyAssign tag to define connection between producer and consumercmpl %esi,%edx cmpl %esi, %edx.1  cc.1– 11 –CS 105Translation Example #4Translation Example #4Instruction control unit determines destination of jumpPredicts whether will be taken and targetStarts fetching instruction at predicted destinationExecution unit simply checks whether or not prediction was OKIf not, it signals instruction controlInstruction control then “invalidates” any operations generated fr om misfetched instructionsBegins fetching and decoding instructi ons at corr ect targetjl .L24 jl-taken cc.1– 12 –CS 105Visualizing OperationsVisualizing OperationsOperationsOperationsVertical position denotes time at which executedCannot begin operation until operands availableHeight denotes latencyOperandsOperandsArcs shown only for operands that are passed within execution unitcc.1t.1load%ecx.1inclcmpljl%edx.0%edx.1%ecx.0imullload (%eax,%edx,4)  t.1imull t.1, %ecx.0  %ecx.1incl %edx.0  %edx.1cmpl %esi, %edx.1  cc.1jl-taken cc.1Time– 13 –CS 105Visualizing Operations (cont.)Visualizing Operations (cont.)OperationsOperationsSame as before, except that add has latency of 1Timecc.1t.1%ecx.i +1inclcmpljlload%edx.0%edx.1%ecx.0addl%ecx.1loadload (%eax,%edx,4)  t.1addl t.1, %ecx.0  %ecx.1incl %edx.0  %edx.1cmpl %esi, %edx.1  cc.1jl-taken cc.1– 14 –CS 105cc.1cc.2%ecx.0%edx.3t.1imull%ecx.1inclcmpljl%edx.0i=0loadt.2imull%ecx.2inclcmpljl%edx.1i=1loadcc.3t.3imull%ecx.3inclcmpljl%edx.2i=2loadCycle123456789101112131415cc.1cc.2Iteration 3Iteration 2Iteration

View Full Document