Berkeley COMPSCI 252 - Vector Processing, Branch Prediction, Dependence Speculation - D838534

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 252> Vector Processing, Branch Prediction, Dependence Speculation

DOC PREVIEW

Berkeley COMPSCI 252 - Vector Processing, Branch Prediction, Dependence Speculation

School name University of California, Berkeley

Course Compsci 252- Graduate Computer Architecture

Pages 62

This preview shows page 1-2-3-4-29-30-31-32-59-60-61-62 out of 62 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 62 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Lecture 8: Vector Processing, Branch Prediction, Dependence SpeculationReviewLimits to ILPSlide 4Upper Limit to ILP: Ideal Machine (Figure 4.38, page 319)More Realistic HW: Branch Impact Figure 4.40, Page 323More Realistic HW: Register Impact Figure 4.44, Page 328More Realistic HW: Alias Impact Figure 4.46, Page 330Realistic HW for ‘9X: Window Impact (Figure 4.48, Page 332)Braniac vs. Speed Demon(1993)Problems with scalar approach to ILP extractionCost-performance of simple vs. OOOAlternative Model: Vector ProcessingProperties of Vector ProcessorsOperation & Instruction Count: RISC v. Vector Processor (from F. Quintana, U. Barcelona.)Styles of Vector ArchitecturesComponents of Vector Processor“DLXV” Vector InstructionsMemory operationsDAXPY (Y = a * X + Y)Example Vector MachinesVector Linpack Performance (MFLOPS)CS 252 AdministriviaVector SurpriseVirtual Processor Vector ModelVector Architectural StateVector ImplementationVector Terminology: 4 lanes, 2 vector functional unitsVector Execution TimeDLXV Start-up TimeWhy startup time for each vector instruction?Vector Load/Store Units & MemoriesVector LengthStrip MiningCommon Vector MetricsVector StrideVector Opt #1: ChainingExample Execution of Vector CodeVector Opt #2: Conditional ExecutionVector Opt #3: Sparse MatricesSparse Matrix ExampleApplicationsVector for Multimedia?MMX InstructionsVectors and Variable Data WidthMediaprocessing: Vectorizable? Vector Lengths?Compiler Vectorization on Cray XMPVector PitfallsVector AdvantagesVector SummaryPrediction: Branches, Dependencies, Data New era in computing?Dynamic Branch PredictionSlide 53Dynamic Branch Prediction (Jim Smith, 1981)BHT AccuracyCorrelating BranchesSlide 57Accuracy of Different Schemes (Figure 4.21, p. 272)Re-evaluating CorrelationNeed Address at Same Time as PredictionPredicated ExecutionDynamic Branch Prediction SummaryJDK.F98 Slide 1Lecture 8: Vector Processing,Branch Prediction, Dependence SpeculationProf. John KubiatowiczComputer Science 252Fall 1998JDK.F98 Slide 2Review•Precise exceptions/Speculation: Out-of-order execution, In-order commit (reorder buffer)•Explicit Renaming: more physical registers than needed by ISA. Uses a translation table•Memory Disambiguation: Detecting RAW hazards that occur through the memory interface.–Simplistic approach: wait until addresses for all previous stores are ready before starting load.•Superscalar and VLIW: CPI < 1 (IPC > 1)–Dynamic issue vs. Static issue–More instructions issue at same time => larger hazard penalty–Limitation is often number of instructions that you can successfully fetch and decode per cycle  “Flynn barrier”•SW Pipelining–Symbolic Loop Unrolling to get most from pipeline with little code expansion, little overheadJDK.F98 Slide 3Limits to ILP•Conflicting studies of amount–Benchmarks (vectorized Fortran FP vs. integer C programs)–Hardware sophistication–Compiler sophistication•How much ILP is available using existing mechanims with increasing HW budgets?•Do we need to invent new HW/SW mechanisms to keep on processor performance curve?–Intel MMX–Motorola AltaVec–Supersparc Multimedia ops, etc.JDK.F98 Slide 4Limits to ILPInitial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start:1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided2. Branch prediction–perfect; no mispredictions 3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal1 cycle latency for all instructions; unlimited number of instructions issued per clock cycleJDK.F98 Slide 5Upper Limit to ILP: Ideal Machine(Figure 4.38, page 319)ProgramsInstruction Issues per cycle020406080100120140160gcc espresso li fpppp doducd tomcatv54.862.617.975.2118.7150.1Integer: 18 - 60FP: 75 - 150IPCJDK.F98 Slide 6ProgramInstruction issues per cycle0102030405060gcc espresso li fpppp doducd tomcatv3541166158609121048156764613456674514452222941946PerfectSelective predictorStandard 2-bitStatic NoneMore Realistic HW: Branch ImpactFigure 4.40, Page 323Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycleProfileBHT (512)Pick Cor. or BHTPerfectNo predictionFP: 15 - 45Integer: 6 - 12IPCJDK.F98 Slide 7ProgramInstruction issues per cycle0102030405060gcc espresso li fpppp doducd tomcatv11151229541015124916101312351544910112011285565574454555945Infinite2561286432NoneMore Realistic HW: Register ImpactFigure 4.44, Page 328Change 2000 instr window, 64 instr issue, 8K 2 level Prediction64 None256Infinite 32128Integer: 5 - 15FP: 11 - 45IPCJDK.F98 Slide 8ProgramInstruction issues per cycle05101520253035404550gcc espresso li fpppp doducd tomcatv101512491645779491645446535334445PerfectGlobal/stack PerfectInspectionNoneMore Realistic HW: Alias ImpactFigure 4.46, Page 330Change 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registersNoneGlobal/Stack perf;heap conflictsPerfect Inspec.Assem.FP: 4 - 45(Fortran,no heap)Integer: 4 - 9IPCJDK.F98 Slide 9ProgramInstruction issues per cycle0102030405060gcc expresso li fpppp doducd tomcatv1015125217561015124716101311351534910112212889149146668794445463233334522Infinite25612864321684Realistic HW for ‘9X: Window Impact(Figure 4.48, Page 332)Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window64 16256Infinite 32128 8 4Integer: 6 - 12FP: 8 - 45IPCJDK.F98 Slide 10•8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe) vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)Braniac vs. Speed Demon(1993)BenchmarkSPECMarks0100200300400500600700800900espressolieqntottcompressscgccspicedoducmdljdp2wave5tomcatvoraalvinnearmdljsp2swm256su2corhydro2dnasafppppJDK.F98 Slide 11Problems with scalar approach to ILP extraction•Limits to conventional exploitation of ILP:1) pipelined clock rate: at some point, each increase in clock rate has corresponding CPI increase (branches, other hazards)2) instruction fetch and decode: hard to fetch and decode more instructions per clock cycle3) cache hit rate: some long-running (scientific) programs have very large data sets accessed with poor locality; others have continuous data streams (multimedia) and hence poor locality4) power: out-of-order, speculative execution has serious costs in terms of power

View Full Document