CMU CS 15213 - Code Optimization II - D368848

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15213> Code Optimization II

DOC PREVIEW

CMU CS 15213 - Code Optimization II

School name Carnegie Mellon University

Course Cs 15213- Introduction to Computer Systems

Pages 57

This preview shows page 1-2-3-4-26-27-28-54-55-56-57 out of 57 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 57 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Code Optimization II Dec. 2, 2008Motivate: Compute FactorialsFaster Versions (1)Faster Versions (2)Faster Versions (3)Getting High PerformanceModern CPU DesignPentium IV Nocona CPUCPUs: Nocona vs. Core 2Instruction ControlTranslating into OperationsTraditional View of Instruction ExecutionDataflow View of Instruction ExecutionExample ComputationCycles Per Elementx86-64 Compilation of Combine4Serial ComputationLoop UnrollingEffect of Loop UnrollingLoop Unrolling with ReassociationEffect of ReassociationReassociated ComputationLoop Unrolling with Separate Accum.Effect of ReassociationSeparate Accum. ComputationUnrolling & AccumulatingUnrolling & Accumulating: Intel FP *Unrolling & Accumulating: Intel FP +Unrolling & Accumulating: Intel Int *Unrolling & Accumulating: Intel Int +Nocona vs. Core 2 FP *Nocona vs. Core 2 Int *Intel vs. AMD FP *Intel vs. AMD Int *Intel vs. AMD Int +Can We Go Faster?Programming with SSE3Scalar & SIMD OperationsGetting GCC to Use SIMD OperationsImplementing CombineGetting StartedSIMD LoopCompletionSIMD ResultsWhat About Branches?Branch OutcomesBranch PredictionBranch Prediction Through LoopBranch Misprediction InvalidationBranch Misprediction RecoveryDetermining Misprediction PenaltyForcing ConditionalTesting MethodologyTesting OutcomesRole of ProgrammerFact RevisitedPreventing ReassociationCode Optimization IIDec. 2, 2008Code Optimization IIDec. 2, 2008TopicsTopics Machine Dependent Optimizationsz Understanding Processor Operationsz Branches and Branch Predictionclass26.ppt15-213“The course that gives CMU its Zip!”–2–15-213, F’08Motivate: Compute FactorialsMotivate: Compute FactorialsMachinesMachines Intel Pentium 4 Nocona, 3.2 GHzz Fish Machines Intel Core 2, 2.7 GHzCompiler VersionsCompiler Versions GCC 3.4.2 (current on Fish machines) GCC 4.1.2 (most recent available)int rfact(int n){if (n <= 1)return 1;return n * rfact(n-1);}int fact(int n){int i;int result = 1;for (i = n; i > 0; i--)result = result * i;return result;}Machine Nocona Core 2Compiler 3.4 3.4 4.1rfact15.5 6.0 3.0fact10.0 3.0 3.0Cycles Per ElementCycles Per Element–3–15-213, F’08Faster Versions (1)Faster Versions (1)Loop UnrollingLoop Unrolling Compute more values per iteration Does not help hereint fact_u3a(int n){int i;int result = 1;for (i = n; i >= 3; i-=3) {result =result * i * (i-1) * (i-2);}for (; i > 0; i--)result *= i;return result;}Machine Nocona Core 2Compiler 3.4 3.4 4.1rfact15.5 6.0 3.0fact10.0 3.0 3.0fact_u3a10.0 3.0 3.0Cycles Per ElementCycles Per Element–4–15-213, F’08Faster Versions (2)Faster Versions (2)Loop Unrolling + ReassociationLoop Unrolling + Reassociation ~3X drop for GCC 3.4 No improvement for GCC 4.1z Very strange!int fact_u3b(int n){int i;int result = 1;for (i = n; i >= 3; i-=3) {result =result * (i * (i-1) * (i-2));}for (; i > 0; i--)result *= i;return result;}Machine Nocona Core 2Compiler 3.4 3.4 4.1rfact15.5 6.0 3.0fact10.0 3.0 3.0fact_u3a10.0 3.0 3.0fact_u3b3.3 1.03.0Cycles Per ElementCycles Per Element–5–15-213, F’08Faster Versions (3)Faster Versions (3)Loop Unrolling + Multiple AccumulatorsLoop Unrolling + Multiple Accumulators ~3X drop for all machinesint fact_u3c(int n){int i;int result0 = 1;int result1 = 1;int result2 = 1;for (i = n; i >= 3; i-=3) {result0 *= i;result1 *= (i-1);result2 *= (i-2);}for (; i > 0; i--)result0 *= i;return result0 * result1 * result2;}Machine Nocona Core 2Compiler 3.4 3.4 4.1rfact15.5 6.0 3.0fact10.0 3.0 3.0fact_u3a10.0 3.0 3.0fact_u3b3.3 1.0 3.0fact_u3c3.3 1.0 1.0Cycles Per ElementCycles Per Element–6–15-213, F’08Getting High PerformanceGetting High PerformanceDonDon’’t Do Anything Stupidt Do Anything Stupid Watch out for hidden algorithmic inefficiencies Write compiler-friendly codez Help compiler past optimization blockers: function calls & memory refs.Tune Code For MachineTune Code For Machine Exploit instruction-level parallelism Avoid unpredictable branches Make code cache friendlyz Covered later in course–7–15-213, F’08Modern CPU DesignModern CPU DesignExecutionExecutionFunctionalUnitsInstruction ControlInstruction ControlInteger/BranchFPAddFPMult/DivLoad StoreInstructionCacheDataCacheFetchControlInstructionDecodeAddressInstrs.OperationsPrediction OK?DataDataAddr.Addr.GeneralIntegerOperation ResultsRetirementUnitRegisterFileRegister Updates–8–15-213, F’08Pentium IV Nocona CPUPentium IV Nocona CPUMultiple Instructions Can Execute in ParallelMultiple Instructions Can Execute in Parallel 1 load, with address computation 1 store, with address computation 2 simple integer (one may be branch) 1 complex integer (multiply/divide) 1 FP/SSE3 unit 1 FP move (does all conversions)Some Instructions Take > 1 Cycle, but Can be PipelinedSome Instructions Take > 1 Cycle, but Can be PipelinedInstruction Latency Cycles/Issue Load / Store 5 1 Integer Multiply 10 1 Integer/Long Divide 36/106 36/106 Single/Double FP Multiply 7 2 Single/Double FP Add 5 2 Single/Double FP Divide 32/46 32/46–9–15-213, F’08CPUs: Nocona vs. Core 2CPUs: Nocona vs. Core 2Nocona (3.2 GHz) (Saltwater fish machines)Nocona (3.2 GHz) (Saltwater fish machines)Instruction Latency Cycles/Issue Load / Store 10 1 Integer Multiply 10 1 Integer/Long Divide 36/106 36/106 Single/Double FP Multiply 7 2 Single/Double FP Add 5 2 Single/Double FP Divide 32/46 32/46Core 2 (2.7 GHz) (Recent Intel microprocessors)Core 2 (2.7 GHz) (Recent Intel microprocessors) Load / Store 5 1 Integer Multiply 3 1 Integer/Long Divide 18/50 18/50 Single/Double FP Multiply 4 1 Single/Double FP Add 3 1 Single/Double FP Divide 18/32 18/32–10–15-213, F’08Instruction ControlInstruction ControlInstruction ControlGrabs Instruction Bytes From MemoryGrabs Instruction Bytes From Memory Based on current PC + predicted targets for predicted branches Hardware dynamically guesses whether branches taken/not taken and (possibly) branch targetTranslates Instructions Into Translates Instructions Into Operations Operations (for CISC style CPUs)(for CISC style CPUs) Primitive steps required to perform instruction Typical instruction requires 1–3 operationsConverts Register References Into Converts Register References Into TagsTags Abstract identifier linking destination of one operation with sources of later operationsInstruction

View Full Document

CMU CS 15213 - Code Optimization II

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-26-27-28-54-55-56-57 out of 57 pages.

CMU CS 15213 - Code Optimization II

Sign up for free to view:

Please select your school