Unformatted text preview:

Edgar GabrielCOSC 6385 Computer Architecture - Vector ProcessorsEdgar GabrielSpring 2011COSC 6385 – Computer ArchitectureEdgar GabrielVector Processors• Chapter F of the 4thedition (Chapter G of the 3rdedition)– Available in CD attached to the book – Anybody having problems to find it should contact me• Vector processors big in ‘70 and ‘80• Still used today– Vector machines: Earth Simulator, NEC SX9, Cray X1– Graphics cards– MMX, SSE, SSE2, SSE3 are to some extent ‘vector units’COSC 6385 – Computer ArchitectureEdgar GabrielMain concepts• Vector processors abstract operations on vectors, e.g. replace the following loopby• Some languages offer high-level support for these operations (e.g. Fortran90 or newer)for (i=0; i<n; i++) {a[i] = b[i] + c[i];}a = b + c; ADDV.D V10, V8, V6COSC 6385 – Computer ArchitectureEdgar GabrielMain concepts (II)• Advantages of vector instructions– A single instruction specifies a great deal of work– Since each loop iteration must not contain data dependence to other loop iterations• No need to check for data hazards between loop iterations• Only one check required between two vector instructions• Loop branches eliminatedCOSC 6385 – Computer ArchitectureEdgar GabrielBasic vector architecture• A modern vector processor contains– Regular, pipelined scalar units– Regular scalar registers– Vector units – (inventors of pipelining! )– Vector register: can hold a fixed number of entries (e.g. 64)– Vector load-store unitsCOSC 6385 – Computer ArchitectureEdgar GabrielComparison MIPS code vs. vector codeExample: Y=aX+Y for 64 elementsL.D F0, a /* load scalar a*/DADDIU R4, Rx, #512 /* last address */L: L.D F2, 0(Rx) /* load X(i) */MUL.D F2, F2, F0 /* calc. a times X(i)*/L.D F4, 0(Ry) /* load Y(i) */ADD.D F4, F4, F2 /* aX(I) + Y(i) */S.D F4, 0(Ry) /* store Y(i) */DADDIU Rx, Rx, #8 /* increment X*/DADDIU Ry, Ry, #8 /* increment Y */DSUBU R20, R4, Rx /* compute bound */BNEZ R20, LCOSC 6385 – Computer ArchitectureEdgar GabrielComparison MIPS code vs. vector code (II)Example: Y=aX+Y for 64 elementsL.D F0, a /* load scalar a*/LV V1, 0(Rx) /* load vector X */MULVS.D V2, V1, F0 /* vector scalar mult*/LV V3, 0(Ry) /* load vector Y */ADDV.D V4, V2, V3 /* vector add */SV V4, 0(Ry) /* store vector Y */COSC 6385 – Computer ArchitectureEdgar GabrielVector execution time• Convoy: set of vector instructions that could potentially begin execution in one clock cycle– A convoy must not contain structural or data hazards– Similar to VLIW– Initial assumption: a convoy must complete before another convoy can start execution• Chime: time unit to execute a convoy– Independent of the vector length– A sequence consisting of m convoys executes in m chimes– A sequence consisting of m convoys and vector length ntakes approximately mxn clock cyclesCOSC 6385 – Computer ArchitectureEdgar GabrielExampleLV V1, 0(Rx) /* load vector X */MULVS.D V2, V1, F0 /* vector scalar mult*/LV V3, 0(Ry) /* load vector Y */ADDV.D V4, V2, V3 /* vector add */SV V4, 0(Ry) /* store vector Y */• Convoys of the above code sequence:1. LV2. MULVS.D LV3. ADDV.D4. SV4 convoys 4 chimes to execute COSC 6385 – Computer ArchitectureEdgar GabrielOverhead• Start-up overhead of a pipeline: how many cycles does it take to fill the pipeline before the first result is available?Unit Start-upLoad/store 12Multiply 7Add 6Convoy Starting time First result Last resultLV0 12 11+nMULVS LV12+n 12+n+12 23+2nADDV24+2n 24+2n+6 29+3nSV30+3n 30+3n+12 41+4nCOSC 6385 – Computer ArchitectureEdgar GabrielPipelining – Metrics (I)Clocktime, time to finish one segment/sub-operationnumber of stages of the pipelinelength of the vectorStartup time in clocks, time after which the first result is available,length of the loop to achieve half of the maximum speedAssuming a simple loop such as:for (i=0; i<n; i++) {a[i] = b[i] + c[i];}cTmnS21NmS=COSC 6385 – Computer ArchitectureEdgar GabrielPipelining – Metrics (II)Number of operations per loop iterationtotal number of operations for the loop, withSpeed of the loop isFor we get optotalopnopoptotal*=)11())1((*+−=−+==nmTopnmTnoptimeopFcctotal∞→ncTopF =maxCOSC 6385 – Computer ArchitectureEdgar GabrielPipelining – Metrics (III)Because of the Definition of we now haveorand length of the loop required to achieve half of the theoretical peak performance of a pipeline is equal to the number of segments (stages) of the pipeline21NccTopFNmTop2121)11(max21==+−21121=+−Nm121−=mNCOSC 6385 – Computer ArchitectureEdgar GabrielPipelining – Metrics (IV)More general: is defined throughand leads toE.g. for you get the closer you would like to get to the maximum performance of your pipeline, the larger the iteration counter of your loop has to beαNccTopNmTopαα=+−)11(111−−=ααmN43=αmmN 3)1(343≈−=COSC 6385 – Computer ArchitectureEdgar GabrielVector length control• What happens if the length is not matching the length of the vector registers?• A vector-length register (VLR) contains the number of elements used within a vector register• Strip mining: split a large loop into loops less or equal the maximum vector length (MVL)COSC 6385 – Computer ArchitectureEdgar GabrielVector length control (II)low =0;VL = (n mod MVL);for (j=0; j < n/MVL; j++ ) {for (i=low; i < VL; i++ ) {Y(i) = a * X(i) + Y(i);}low += VL;VL = MVL;}COSC 6385 – Computer ArchitectureEdgar GabrielVector stride• Memory on vector machines typically organized in multiple banks– Allow for independent management of different memory addresses– Memory bank time an order of magnitude larger than CPU clock cycle• Example: assume 8 memory banks and 6 cycles of memory bank time to deliver a data item– Overlapping of multiple data requests by the hardwareCOSC 6385 – Computer ArchitectureEdgar GabrielCycle Bank1Bank2Bank3Bank4Bank5Bank6Bank7Bank80 X(0)1 Busy X(1)2 Busy Busy X(2)3 Busy Busy Busy X(3)4 Busy Busy Busy Busy X(4)5 Busy Busy Busy Busy Busy X(5)6 Busy Busy Busy Busy Busy X(6)7 Busy Busy Busy Busy Busy X(7)8 X(8) Busy Busy Busy Busy Busy9 Busy X(9) Busy Busy Busy Busy10 Busy Busy X(10) Busy Busy Busy11 Busy Busy Busy X(11) Busy Busy12 Busy Busy Busy Busy X(12) Busy13 Busy Busy Busy Busy Busy X(13)14 Busy Busy Busy Busy Busy X(14)COSC 6385 – Computer ArchitectureEdgar GabrielVector stride (II)• What happens if the code does not access subsequent elements


View Full Document

UH COSC 6385 - Vector Processors

Download Vector Processors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Vector Processors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Vector Processors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?