Berkeley COMPSCI 252 - Lecture 9 - D2017923

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 252> Lecture 9

DOC PREVIEW

Berkeley COMPSCI 252 - Lecture 9

School name University of California, Berkeley

Course Compsci 252- Graduate Computer Architecture

Pages 52

This preview shows page 1-2-3-25-26-27-28-50-51-52 out of 52 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS252 Graduate Computer Architecture Lecture 9 Instruction Level Parallelism: Potential? Vector ProcessingReview: Instruction Level ParallelismReview: Trace SchedulingLimits to Multi-Issue MachinesLimits to ILP: Limit StudiesLimits to ILP: Specifications for a “perfect” machineUpper Limit to ILP: Ideal Machine (Figure 4.38, page 319)More Realistic HW: Branch Impact Figure 4.40, Page 323More Realistic HW: Register Impact Figure 4.44, Page 328More Realistic HW: Alias Impact Figure 4.46, Page 330Realistic HW for ‘9X: Window Impact (Figure 4.48, Page 332)Braniac vs. Speed Demon(1993)Problems with scalar approach to ILP extractionCost-performance of simple vs. OOOCS 252 AdministriviaArchitecture in practiceComplexity of Superscalar ProcessorsAlternative Model: Vector Processing“DLXV” Vector InstructionsProperties of Vector ProcessorsOperation & Instruction Count: RISC v. Vector Processor (from F. Quintana, U. Barcelona.)Styles of Vector ArchitecturesComponents of Vector ProcessorCommon Vector MetricsDAXPY (Y = a * X + Y)Example Vector MachinesVector ImplementationVector Terminology: 4 lanes, 2 vector functional unitsVector Execution TimeDLXV Start-up TimeVector Opt #1: ChainingExample Execution of Vector CodeMemory operationsMinimum resources for Unit StrideVector StrideVector Opt #2: Sparse MatricesSparse Matrix ExampleVector LengthStrip MiningVector Opt #3: Conditional ExecutionVirtual Processor Vector Model: Treat like SIMD multiprocessorVector Architectural StateApplicationsVector Processing and PowerVector for Multimedia?MMX InstructionsMediaprocessing: Vectorizable? Vector Lengths?Compiler Vectorization on Cray XMPVector PitfallsVector AdvantagesSummary #1: Vector ProcessingSummary #2: ILP? Wherefore art thou?CS252/KubiatowiczLec 9.19/29/00CS252Graduate Computer ArchitectureLecture 9Instruction Level Parallelism: Potential?Vector ProcessingSeptember 29, 2000Prof. John KubiatowiczCS252/KubiatowiczLec 9.29/29/00Review: Instruction Level Parallelism•Instruction level parallelism (ILP)–potential of short instruction sequences to execute in parallel–Often measured by IPC (Instructions per cycle) instead of CPI (cycles per instruction)•Superscalar and VLIW: CPI < 1 (IPC > 1)Dynamic vs Static Issue:–All about increasing issue and commit bandwidth: IPC limited by the rate of inflow and exit from pipeline–More instructions issue at same time => larger hazard penalty–Limitation is often number of instructions that you can successfully fetch and decode per cycle  “Flynn barrier”•SW Pipelining–Symbolic Loop Unrolling to get most from pipeline with little code expansion, little overhead•Branches, branches, branches: How to keep feeding useful instructions to the pipeline???–Since 1 in 5 instruction is a branch, must predict either in software or hardwareCS252/KubiatowiczLec 9.39/29/00Review: Trace Scheduling•Parallelism across IF branches vs. LOOP branches•Two steps:–Trace Selection»Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code–Trace Compaction»Squeeze trace into few VLIW instructions»Need bookkeeping code in case prediction is wrong •This is a form of compiler-generated branch prediction!–Make “common-case” fast at expense of less common case–Compiler must generate “fixup” code to handle cases in which trace is not the taken branch–Needs extra registers: undoes bad guess by discardingCS252/KubiatowiczLec 9.49/29/00Limits to Multi-Issue Machines•Inherent limitations of ILP–1 branch in 5: How to keep a 5-way VLIW busy?–Latencies of units: many operations must be scheduled–Need approx Pipeline Depth x No. Functional Units of independent operations to keep all pipelines busy.–Difficulties in building HW•Complexity:–Easy: More instruction bandwidth from L1 cache–Easy: More execution bandwidth »Duplicate FUs to get parallel execution–Hard: Increase ports to Register File (bandwidth)»VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg–Harder: Getting useful instructions to pipeline (branch prediction)–Harder: Increase ports to memory (bandwidth)–Harder: Latency to memory–Decoding Superscalar and impact on clock rate, pipeline depth?CS252/KubiatowiczLec 9.59/29/00Limits to ILP: Limit Studies•Conflicting studies of amount: 2? 1000?–Benchmarks (vectorized Fortran FP vs. integer C programs)–Hardware sophistication–Compiler sophistication•How much ILP is available using existing mechanisms with increasing HW budgets?•Do we need to invent new HW/SW mechanisms to keep on processor performance curve?–Intel MMX–Motorola AltaVec–Supersparc Multimedia ops, etc.–Reinvent vector processing (IRAM)–Something else? Neural nets? Reconfigurable logic?CS252/KubiatowiczLec 9.69/29/00Limits to ILP:Specifications for a “perfect” machineAssumptions for ideal/perfect machine to start:1. Branch prediction–perfect; no mispredictions 2. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided3. Memory-address alias analysis – addresses are known in advance & a store can be moved before a load provided addresses not equal4. Window Size - machine with perfect speculation & an unbounded buffer of instructions available1 cycle latency for all instructions; MIPS compilers; unlimited number of instructions issued per cycleCS252/KubiatowiczLec 9.79/29/00Upper Limit to ILP: Ideal Machine(Figure 4.38, page 319)ProgramsInstruction Issues per cycle020406080100120140160gcc espresso li fpppp doducd tomcatv54.862.617.975.2118.7150.1Integer: 18 - 60FP: 75 - 150IPCCS252/KubiatowiczLec 9.89/29/00ProgramInstruction issues per cycle0102030405060gcc espresso li fpppp doducd tomcatv3541166158609121048156764613456674514452222941946PerfectSelective predictorStandard 2-bitStatic NoneMore Realistic HW: Branch ImpactFigure 4.40, Page 323Change from Infinite window to 2000 and maximum issue of 64 instructions per clock cycleProfileBHT (512)Pick Cor. or BHTPerfectNo predictionFP: 15 - 45Integer: 6 - 12IPCCS252/KubiatowiczLec 9.99/29/00ProgramInstruction issues per cycle0102030405060gcc espresso li fpppp doducd tomcatv11151229541015124916101312351544910112011285565574454555945Infinite2561286432NoneMore Realistic HW: Register ImpactFigure 4.44, Page 328Change 2000 instr window, 64 instr issue, 8K 2 level Prediction64 None256Infinite 32128Integer: 5 - 15FP: 11 - 45IPCCS252/KubiatowiczLec

View Full Document