Unformatted text preview:

CPE 631 Lecture 24 Vector Processing Aleksandar Milenkovi milenka ece uah edu Electrical and Computer Engineering University of Alabama in Huntsville CPE 631 AM Outline Properties of Vector Processing Components of a Vector Processor Vector Execution Time Real world Problems Vector Length and Stride Vector Optimizations Chaining Conditional Execution Sparse Matrices 14 01 19 UAH CPE631 2 CPE 631 AM Why Vector Processors Instruction level parallelism Ch 3 4 Deeper pipeline and wider superscalar machines to extract more parallelism more register file ports more registers more hazard interlock logic In dynamically scheduled machines instruction window reorder buffer rename register files must grow to have enough capacity to keep relevant information about in flight instructions Difficult to build machines supporting large number of in flight instructions limit the issue width and pipeline depths limit the amount parallelism you can extract Commercial versions long before ILP machines 14 01 19 UAH CPE631 3 CPE 631 AM Vector Processing Definitions Vector a set of scalar data items all of the same type stored in memory Vector processor an ensemble of hardware resources including vector registers functional pipelines processing elements and register counters for performing vector operations Vector processing occurs when arithmetic or logical operations are applied to vectors v1 v2 r1 r2 VECTOR SCALAR N operations 1 operation add r3 r1 r2 14 01 19 r3 add vv v3 v1 v2 UAH CPE631 v3 vector length 4 CPE 631 AM Properties of Vector Processors 1 Single vector instruction specifies lots of work equivalent to executing an entire loop fewer instructions to fetch and decode 2 Computation of each result in the vector is independent of the computation of other results in the same vector deep pipeline without data hazards high clock rate 3 Hw checks for data hazards only between vector instructions once per vector not per vector element 4 Access memory with known pattern elements are all adjacent in memory highly interleaved memory banks provides high bandw access is initiated for entire vector high memory latency is amortised no data caches are needed 5 Control hazards from the loop branches are reduced nonexistent for one vector instruction 14 01 19 UAH CPE631 5 CPE 631 AM Properties of Vector Processors cont d Vector operations arithmetic add sub mul div memory accesses effective address calculations Multiple vector instructions can be in progress at the same time more parallelism Applications to benefit Large scientific and engineering applications car crash simulations whether forecasting Multimedia applications 14 01 19 UAH CPE631 6 CPE 631 AM Basic Vector Architectures Vector processor ordinary pipelined scalar unit vector unit Types of vector processors Memory memory processors all vector operations are memory to memory CDC Vector register processors all vector operations except load and store are among the vector registers CRAY 1 CRAY 2 X MP Y MP NEX SX 2 3 Fujitsu VMIPS Vector processor as an extension of the 5 stage MIPS processor 14 01 19 UAH CPE631 7 CPE 631 AM Components of a vector register processor Vector Registers each vector register is a fixed length bank holding a single vector has at least 2 read and 1 write ports typically 8 32 vector registers each holding 64 128 64 bit elements VMIPS 8 vector registers each holding 64 elements 16 Rd ports 8 Wr ports Vector Functional Units FUs fully pipelined start new operation every clock typically 4 to 8 FUs FP add FP mult FP reciprocal 1 X integer add logical shift may have multiple of same unit VMIPS 5 FUs FP add sub FP mul FP div FP integer FP logical 14 01 19 UAH CPE631 8 CPE 631 AM Components of a vector register processor cont d Vector Load Store Units LSUs fully pipelined unit to load or store a vector may have multiple LSUs VMIPS 1 VLSU bandwidth is 1 word per cycle after initial delay Scalar registers single element for FP scalar or address VMIPS 32 GPR 32 FPRs they are read out and latched at one input of the FUs Cross bar to connect FUs LSUs registers cross bar to connect Rd Wr ports and FUs 14 01 19 UAH CPE631 9 CPE 631 AM VMIPS Basic Structure Main Memory Vector Load Store FP add subtract FP multiply FP divide Vector registers Integer Logical 8 64 element vector registers 5 FUs each unit is fully pipelined can start a new operation on every clock cycle Load store unit fully pipelined Scalar registers Scalar registers 14 01 19 UAH CPE631 10 CPE 631 AM VMIPS Vector Instructions Instr ADDV D ADDSV D MULV D MULSV D LV LVWS LVI CeqV D MTC1 MFC1 Operands V1 V2 V3 V1 F0 V2 V1 V2 V3 V1 F0 V2 V1 R1 V1 R1 R2 V1 R1 V2 VM V1 V2 VLR R1 VM R1 Operation Comment V1 V2 V3 vector vector V1 F0 V2 scalar vector V1 V2xV3 vector x vector V1 F0xV2 scalar x vector V1 M R1 R1 63 load stride 1 V1 M R1 R1 63 R2 load stride R2 V1 M R1 V2 i i 0 63 indir gather VMASKi V1i V2i comp setmask Vec Len Reg R1 set vector length R1 Vec Mask set vector mask See table G3 for the VMIPS vector instructions 14 01 19 UAH CPE631 11 CPE 631 AM VMIPS Vector Instructions cont d Instr SUBV D SUBSV D SUBVS D DIVV D DIVSV D DIVVS D POP CVM Operands V1 V2 V3 V1 F0 V2 V1 V2 F0 V1 V2 V3 V1 F0 V2 V1 V2 F0 Operation V1 V2 V3 V1 F0 V2 V1 V2 F0 V1 V2 V3 V1 F0 V2 V1 V2 F0 Comment vector vector scalar vector vector scalar vector vector scalar vector vector scalar R1 M Count the 1s in the VM register Set the vector mask register to all 1s See table G3 for the VMIPS vector instructions 14 01 19 UAH CPE631 12 CPE 631 AM DAXPY Double a X Y Assuming vectors X Y are length 64 L D F0 a load scalar a LV V1 Rx load vector X Scalar vs Vector MULVS V2 V1 F0 vector scalar mult LV load vector Y V3 Ry ADDV D V4 V2 V3 add L D F0 a SV DADDIU R4 Rx 512 last address to load loop L D F2 0 Rx load X i MULT D F2 F0 F2 a X i L D F4 0 Ry load Y i ADD D F4 F2 F4 a X i Y i S D F4 0 Ry store into Y i DADDIU Rx Rx 8 increment index to X DADDIU Ry Ry 8 increment index to Y DSUBU R20 R4 Rx compute bound BNEZ R20 loop check if done 14 01 19 Ry V4 store the result Operations 578 2 9 64 vs 321 1 5 64 1 8X Instructions 578 2 9 64 vs 6 instructions 96X Hazards 64X fewer pipeline hazards UAH CPE631 13 CPE 631 AM Vector Execution Time Time f vector length data dependencies struct hazards Initiation rate rate at which a FU consumes vector elements number of lanes usually 1 or 2 on Cray T 90 Convoy set of vector instructions that can …


View Full Document

UAH CPE 631 - Vector Processing

Loading Unlocking...
Login

Join to view Vector Processing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Vector Processing and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?