Berkeley COMPSCI 252 - Lecture 7: Vector Processing - D2382864

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 252> Lecture 7: Vector Processing

DOC PREVIEW

Berkeley COMPSCI 252 - Lecture 7: Vector Processing

School name University of California, Berkeley

Course Compsci 252- Graduate Computer Architecture

Pages 53

This preview shows page 1-2-3-4-24-25-26-50-51-52-53 out of 53 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 53 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Lecture 7: Vector ProcessingComputers in the NewsSlide 3Computer NewsReview: Instructon Level ParallelismReviewReview: Theoretical Limits to ILP? (Figure 4.48, Page 332)Problems with conventional approachAlternative Model:Vector ProcessingProperties of Vector ProcessorsOperation & Instruction Count: RISC v. Vector Processor (from F. Quintana, U. Barcelona.)Styles of Vector ArchitecturesComponents of Vector Processor“DLXV” Vector InstructionsMemory operationsDAXPY (Y = a * X + Y)Example Vector MachinesVector Linpack Performance (MFLOPS)Vector SurpriseVirtual Processor Vector ModelVector Architectural StateVector ImplementationVector Terminology: 4 lanes, 2 vector functional unitsTentative VIRAM-1 FloorplanVector Execution TimeDLXV Start-up TimeWhy startup time for each vector instruction?Vector Load/Store Units & MemoriesVector LengthStrip MiningCommon Vector MetricsVector StrideCompiler Vectorization on Cray XMPVector Opt #1: ChainingExample Execution of Vector CodeVector Opt #2: Conditional ExecutionVector Opt #3: Sparse MatricesSparse Matrix ExampleChallenges: Vector Example with dependencyOptimized Vector ExampleApplicationsVector for Multimedia?MMX InstructionsVectors and Variable Data WidthMediaprocesing: Vectorizable? Vector Lengths?Vector PitfallsVector AdvantagesVectors Are InexpensiveMIPS R10000 vs. T0Vectors Lower PowerSuperscalar Energy Efficiency Even WorseVLIW/Out-of-Order versus Modest Scalar+VectorVector SummaryJR.S00 1Lecture 7: Vector ProcessingPrepared by: Professor David A. PattersonEdited and presented by : Prof. Jan RabaeyComputer Science 252, Spring 2000JR.S00 2Computers in the News•At ISSCC (San Francisco)–1 GHz Alpha Processor (Compaq)»1.5 V 0.18 micron CMOS, 7-layer Al, 65 W–1 GHz Single Issue 64b PowerPC Processor (IBM)»0.22 micron CMOS, 6-layer Copper interconnect–1 GHz IA-32 Microprocessor»0.18 micron CMOS, 6-layer Al, low-k dielectric–Other IBM processors»760 MHz processor using multiple Vt and Copper interconnects»660 MHz SOI processor with Cu interconnect–Memory trends: non-volatile; embedded DRAMJR.S00 3Computers in the News•The “Crusoe” VLIW processor from TransmetaTM3120 (333-400 MHz) and TM5400 (500-700 MHz)•Targeted for mobile applications•Supports Linux and Windows •Emulates Intel x86 hardware in software–uses “code morphing”, which translates x86 instructions into VLIW instructions•1 W power dissipation!–Adjusts operating speed and voltage to match the needs of the application!JR.S00 4Computer NewsThermal gradientsTraditional mobile processor versus Crusoe running DVD applicationJR.S00 5Review: Instructon Level Parallelism•High speed execution based on instruction level parallelism (ilp): potential of short instruction sequences to execute in parallel•High-speed microprocessors exploit ILP by:1) pipelined execution: overlap instructions2) superscalar execution: issue and execute multiple instructions per clock cycle3) Out-of-order execution (commit in-order)•Memory accesses for high-speed microprocessor?–Data Cache, possibly multiported, multiple levelsJR.S00 6Review•Speculation: Out-of-order execution, In-order commit (reorder buffer+rename registers)=>precise exceptions•Software Pipelining–Symbolic loop unrolling (instructions from different iterations) to optimize pipeline with little code expansion, little overhead•Superscalar and VLIW: CPI < 1 (IPC > 1)–Dynamic issue vs. Static issue–More instructions issue at same time => larger hazard penalty–# independent instructions = # functional units X latency•Branch Prediction–Branch History Table: 2 bits for loop accuracy–Recently executed branches correlated with next branch?–Branch Target Buffer: include branch address & prediction–Predicated Execution can reduce number of branches, number of mispredicted branchesJR.S00 7ProgramInstruction issues per cycle0102030405060gcc expresso li fpppp doducd tomcatv1015125217561015124716101311351534910112212889149146668794445463233334522Infinite25612864321684Review: Theoretical Limits to ILP?(Figure 4.48, Page 332)Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window64 16256Infinite 32128 8 4Integer: 6 - 12FP: 8 - 45IPCWindowWindowJR.S00 8Problems with conventional approach•Limits to conventional exploitation of ILP:1) pipelined clock rate: at some point, each increase in clock rate has corresponding CPI increase (branches, other hazards)2) instruction fetch and decode: at some point, its hard to fetch and decode more instructions per clock cycle3) cache hit rate: some long-running (scientific) programs have very large data sets accessed with poor locality; others have continuous data streams (multimedia) and hence poor localityJR.S00 9Alternative Model:Vector Processing+r1r2r3add r3, r1, r2SCALAR(1 operation)v1v2v3+vectorlengthadd.vv v3, v1, v2VECTOR(N operations)•Vector processors have high-level operations that work on linear arrays of numbers: "vectors"JR.S00 10Properties of Vector Processors•Each result independent of previous result=> long pipeline, compiler ensures no dependencies=> high clock rate•Vector instructions access memory with known pattern=> highly interleaved memory=> amortize memory latency of over 64 elements=> no (data) caches required! (Do use instruction cache)•Reduces branches and branch problems in pipelines•Single vector instruction implies lots of work ( loop)=> fewer instruction fetchesJR.S00 11Spec92fp Operations (Millions) Instructions (M)Program RISC Vector R / V RISC Vector R / Vswim256 11595 1.1x115 0.8142xhydro2d 5840 1.4x 58 0.8 71xnasa7 6941 1.7x 69 2.2 31xsu2cor 5135 1.4x 51 1.8 29xtomcatv 1510 1.4x 15 1.3 11xwave5 2725 1.1x 27 7.2 4xmdljdp2 3252 0.6x 32 15.8 2xOperation & Instruction Count: RISC v. Vector Processor(from F. Quintana, U. Barcelona.) Vector reduces ops by 1.2X, instructions by 20XJR.S00 12Styles of Vector Architectures•memory-memory vector processors: all vector operations are memory to memory•vector-register processors: all vector operations between vector registers (except load and store)–Vector equivalent of load-store architectures–Includes all vector machines since late 1980s: Cray, Convex, Fujitsu, Hitachi, NEC– We assume vector-register for rest of lecturesJR.S00 13Components of Vector Processor•Vector Register: fixed length bank holding a single vector–has at least 2 read and 1 write

View Full Document