DOC PREVIEW
U of I CS 433 - Vector Computation

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS433G: Computer System OrganizationWhat is Vector Computation?Simplest Case: Vector / VectorVector / Vector With Different Result TypeVariation: Vector / ScalarVariation: ReductionTigerSHARC ReductionSum Reduction Using PR RegsXY Parallelism in TigerSHARCXY Operations on TSXY ParallelismUsing XY to Implement VectorsVector Computation and MemoryAutomatic VectorizationLoop UnrollingAlignment: AnalysisAlignmentVector Register AllocationVector Register AllocationCS433G: Computer System OrganizationLuddy HarrisonVector ComputationTigerSHARC Examples VectorizationWhat is Vector Computation?z Vector computation is a simple form of SIMD computationz Single Instruction Multiple Dataz The data are packed into vectorsz Tuples of a primitive data typez The ALU operates directly on these tuplesz We will write e.g. 4x16 for such a typez The underlying primitive type is taken from contextSimplest Case: Vector / VectorA B C DE F G H++++A+E B+F C+G D+H== = =Here the output type is the same as the two input types.XSR1:0 = R4:3 + R7:6 (S)TigerSHARCVector / Vector With Different Result TypeA B C DE F G HXXXXA×E B×F C×G D×H====4x16 × 4x16 → 4x32YR3:0 = R7:6 * R5:4Note quad output register YR3:0Variation: Vector / ScalarA B C DX X X X++++A+X B+X C+X D+X== = =Scalar: XVector <A,B,C,D>Variation: ReductionA B C D+++A+B+C+DTigerSHARC ReductionSum Reduction Using PR RegsPR0 += SUM SR5:4XY Parallelism in TigerSHARCz We can (sort of) look at this as an additional “×2” vector capacityz R3:0 = R5:4 * R7:6z This does 4 multiply / adds on X and 4 on Y (8 total)z As if < X5:4, Y5:4 > were one vectorXY Operations on TSXYXY Parallelismz This view however is difficult to maintain in light of loads and storesz XR5:4 = [ J2 ] ; YR5:4 = [ K2 ]z If K2 = J2 + 2 then this loads the 4-word vector into <XR5:4, YR5:4>Using XY to Implement VectorsJ2K2What are the problems here (there are two of them)?Vector Computation and Memoryz The structure of in-register vector operations mirrors closely the structure of the in-memory storage of the input and output vectorsz This is primarily what makes hand or automatic vectorization difficultz The XY feature of TigerSHARC is easiest to use on separate vector operationsz A = B + C (vector op) || D = E + F (vector op)Automatic Vectorizationz Loop Unrollingz To provide more than one vector element per iterationz Alignmentz To satisfy load / store alignment restrictionsz Vector register allocationz To map vectors into the register setz This is a very incomplete list, but it is something like a minimum requirementLoop Unrollingfor (i=0; i<100; ++i){a[i] = b[i] + c[i];}for (i=0; i<100; i += 4){a[i+0] = b[i+0] + c[i+0];a[i+1] = b[i+1] + c[i+1];a[i+2] = b[i+2] + c[i+2];a[i+3] = b[i+3] + c[i+3];}Alignment: Analysisvoid f(int *a, int *b, int *c){for (i=0; i<100; ++i){a[i] = b[i] + c[i];}}void f(int *a, int *b, int *c){for (i=0; i<100; i += 4){a[i+0] = b[i+0] + c[i+0];a[i+1] = b[i+1] + c[i+1];a[i+2] = b[i+2] + c[i+2];a[i+3] = b[i+3] + c[i+3];}}Can we loadc[i+3] : c[i+0]using a quad load?Alignmentvoid f(int *a, int *b, int *c){for (i=0; i<100; ++i){a[i] = b[i-1] + c[i];}}Why is this example difficult?What will happen (concerning alignment) when we unroll?How can we fix this?Vector Register AllocationV1 = V2 + V3V4 = V5 + V6V7 = V2 * V3V8 = V5 * V9Vector ADDVector MULShould we allocate <V3, V6> to a vector reg? Or <V3, V9>? If V6 and V9 are simultaneously alive there is a problem!Vector Register AllocationV1 = V2 + V3V4 = V5 + V6V7 = V2 * V6V8 = V5 * V3Vector ADDVector MULShould V3 go into the HIGH half (odd-numbered) register for the ADD, or the LOW half (even-numbered) register for the


View Full Document

U of I CS 433 - Vector Computation

Download Vector Computation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Vector Computation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Vector Computation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?