DOC PREVIEW
Berkeley COMPSCI 252 - Lec 20 - Vector

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS252Graduate Computer ArchitectureLecture 20Vector Processing => MultimediaDavid E. CullerMany slides due to Christoforos E. KozyrakisCS252/CullerLec 20. 24/9/02Vector Processors• Initially developed for super-computing applications, today important for multimedia.• Vector processors have high-level operations that work on linear arrays of numbers: "vectors"+r1 r2r3add r3, r1, r2SCALAR(1 operation)v1 v2v3+vectorlengthvadd.vv v3, v1, v2VECTOR(N operations)CS252/CullerLec 20. 34/9/02Properties of Vector Processors• Single vector instruction implies lots of work (loop)– Fewer instruction fetches• Each result independent of previous result– Multiple operations can be executed in parallel– Simpler design, high clock rate– Compiler (programmer) ensures no dependencies• Reduces branches and branch problems in pipelines• Vector instructions access memory with known pattern– Effective prefetching– Amortize memory latency of over large number of elements– Can exploit a high bandwidth memory system– No (data) caches required!CS252/CullerLec 20. 44/9/02Styles of Vector Architectures• Memory-memory vector processors– All vector operations are memory to memory• Vector-register processors– All vector operations between vector registers (except vector load and store)– Vector equivalent of load-store architectures– Includes all vector machines since late 1980s– We assume vector-register for rest of the lectureCS252/CullerLec 20. 54/9/02Historical Perspective• Mid-60s fear perf. stagnates• SIMD processor arrays actively developed during late 60’s – mid 70’s– bit-parallel machines for image processing• pepe, staran, mpp– word-parallel for scientific• Illiac IV• Cray develops fast scalar – CDC 6600, 7600• CDC bets of vectors with Star-100• Amdahl argues against vectorCS252/CullerLec 20. 64/9/02Cray-1 Breakthrough• Fast, simple scalar processor– 80 MHz!– single-phase, latches• Exquisite electrical and mechanical design• Semiconductor memory• Vector register concept– vast simplification of instruction set– reduced necc. memory bandwidth• Tight integration of vector and scalar• Piggy-back off 7600 stacklib• Later vectorizing compilers developed• Owned high-performance computing for a decade– what happened then?– VLIW competition2CS252/CullerLec 20. 74/9/02Components of a Vector Processor• Scalar CPU: registers, datapaths, instruction fetch logic• Vector register– Fixed length memory bank holding a single vector– Typically 8-32 vector registers, each holding 1 to 8 Kbits– Has at least 2 read and 1 write ports– MM: Can be viewed as array of 64b, 32b, 16b, or 8b elements• Vector functional units (FUs) – Fully pipelined, start new operation every clock– Typically 2 to 8 FUs: integer and FP– Multiple datapaths (pipelines) used for each unit to process multiple elements per cycle• Vector load-store units (LSUs)– Fully pipelined unit to load or store a vector– Multiple elements fetched/stored per cycle– May have multiple LSUs• Cross-barto connect FUs , LSUs, registersCS252/CullerLec 20. 84/9/02Cray-1 Block Diagram• Simple 16-bit RR instr • 32-bit with immed• Natural combinations of scalar and vector• Scalar bit-vectors match vector length• Gather/scatter M-R• Cond. mergeCS252/CullerLec 20. 94/9/02Basic Vector InstructionsInstr. Operands Operation CommentVADD.VV V1,V2,V3 V1=V2+V3 vector + vectorVADD.SV V1,R0,V2 V1=R0+V2 scalar + vectorVMUL.VV V1,V2,V3 V1=V2xV3 vector x vectorVMUL.SV V1,R0,V2 V1=R0xV2 scalar x vectorVLD V1,R1 V1=M[R1..R1+63] load, stride=1VLDS V1,R1,R2 V1=M[R1..R1+63*R2] load, stride=R2VLDX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed("gather")VST V1,R1 M[R1..R1+63]=V1 store, stride=1VSTS V1,R1,R2 V1=M[R1..R1+63*R2] store, stride=R2VSTX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed(“scatter")+ all the regular scalar instructions (RISC style)…CS252/CullerLec 20. 104/9/02Vector Memory Operations• Load/store operations move groups of data between registers and memory• Three types of addressing– Unit stride• Fastest– Non-unit (constant) stride– Indexed (gather-scatter)• Vector equivalent of register indirect• Good for sparse arrays of data• Increases number of programs that vectorize• compress/expand variant also• Support for various combinations of data widths in memory– {.L,.W,.H.,.B} x {64b, 32b, 16b, 8b}CS252/CullerLec 20. 114/9/02Vector Code Example64 element SAXPY: scalarLD R0,aADDI R4,Rx,#512 loop: LD R2, 0(Rx) MULTD R2,R0,R2LD R4, 0(Ry)ADDD R4,R2,R4SD R4, 0(Ry)ADDI Rx,Rx,#8ADDI Ry,Ry,#8SUB R20,R4,RxBNZ R20,loop64 element SAXPY: vectorLD R0,a #load scalar aVLD V1,Rx #load vector XVMUL.SV V2,R0,V1 #vector multVLD V3,Ry #load vector YVADD.VV V4,V2,V3 #vector addVST Ry,V4 #store vector YY[0:63] = Y[0:653] + a*X[0:63]CS252/CullerLec 20. 124/9/02Vector Length• A vector register can hold some maximum number of elements for each data width (maximum vector length or MVL)• What to do when the application vector length is not exactly MVL? • Vector-length (VL) register controls the length of any vector operation, including a vector load or store– E.g. vadd.vv with VL=10 is for (I=0; I<10; I++) V1[I]=V2[I]+V3[I]• VL can be anything from 0 to MVL• How do you code an application where the vector length is not known until run-time?3CS252/CullerLec 20. 134/9/02Strip Mining• Suppose application vector length > MVL• Strip mining– Generation of a loop that handles MVL elements per iteration– A set operations on MVL elements is translated to a single vector instruction• Example: vector saxpy of N elements– First loop handles (N mod MVL) elements, the rest handle MVLVL = (N mod MVL); // set VL = N mod MVLfor (I=0; I<VL; I++) // 1stloop is a single set ofY[I]=A*X[I]+Y[I]; // vector instructionslow = (N mod MVL); VL = MVL; // set VL to MVLfor (I=low; I<N; I++) // 2ndloop requires N/MVLY[I]=A*X[I]+Y[I]; // sets of vector instructionsCS252/CullerLec 20. 144/9/02Optimization 1: Chaining• Suppose:vmul.vv V1,V2,V3vadd.vv V4,V1,V5 # RAW hazard• Chaining– Vector register (V1) is not as a single entity but as a group of individual registers– Pipeline forwarding can work on individual vector elements• Flexible chaining: allow vector to chain to any other active vector operation => more read/write portsvmulvaddvmulvaddUnchainedChainedCray X-mp introduces memory


View Full Document

Berkeley COMPSCI 252 - Lec 20 - Vector

Documents in this Course
Quiz

Quiz

9 pages

Caches I

Caches I

46 pages

Lecture 6

Lecture 6

36 pages

Lecture 9

Lecture 9

52 pages

Figures

Figures

26 pages

Midterm

Midterm

15 pages

Midterm

Midterm

14 pages

Midterm I

Midterm I

15 pages

ECHO

ECHO

25 pages

Quiz  1

Quiz 1

12 pages

Load more
Download Lec 20 - Vector
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lec 20 - Vector and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lec 20 - Vector 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?