U of I CS 433 - Vector Computation - D2407016

Home> Schools> University of Illinois> Computer Science (CS) > CS 433> Vector Computation

DOC PREVIEW

U of I CS 433 - Vector Computation

School name University of Illinois

Course Cs 433- Computer System Organization

Pages 19

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS433G: Computer System OrganizationWhat is Vector Computation?Simplest Case: Vector / VectorVector / Vector With Different Result TypeVariation: Vector / ScalarVariation: ReductionTigerSHARC ReductionSum Reduction Using PR RegsXY Parallelism in TigerSHARCXY Operations on TSXY ParallelismUsing XY to Implement VectorsVector Computation and MemoryAutomatic VectorizationLoop UnrollingAlignment: AnalysisAlignmentVector Register AllocationVector Register AllocationCS433G: Computer System OrganizationLuddy HarrisonVector ComputationTigerSHARC Examples VectorizationWhat is Vector Computation?z Vector computation is a simple form of SIMD computationz Single Instruction Multiple Dataz The data are packed into vectorsz Tuples of a primitive data typez The ALU operates directly on these tuplesz We will write e.g. 4x16 for such a typez The underlying primitive type is taken from contextSimplest Case: Vector / VectorA B C DE F G H++++A+E B+F C+G D+H== = =Here the output type is the same as the two input types.XSR1:0 = R4:3 + R7:6 (S)TigerSHARCVector / Vector With Different Result TypeA B C DE F G HXXXXA×E B×F C×G D×H====4x16 × 4x16 → 4x32YR3:0 = R7:6 * R5:4Note quad output register YR3:0Variation: Vector / ScalarA B C DX X X X++++A+X B+X C+X D+X== = =Scalar: XVector <A,B,C,D>Variation: ReductionA B C D+++A+B+C+DTigerSHARC ReductionSum Reduction Using PR RegsPR0 += SUM SR5:4XY Parallelism in TigerSHARCz We can (sort of) look at this as an additional “×2” vector capacityz R3:0 = R5:4 * R7:6z This does 4 multiply / adds on X and 4 on Y (8 total)z As if < X5:4, Y5:4 > were one vectorXY Operations on TSXYXY Parallelismz This view however is difficult to maintain in light of loads and storesz XR5:4 = [ J2 ] ; YR5:4 = [ K2 ]z If K2 = J2 + 2 then this loads the 4-word vector into <XR5:4, YR5:4>Using XY to Implement VectorsJ2K2What are the problems here (there are two of them)?Vector Computation and Memoryz The structure of in-register vector operations mirrors closely the structure of the in-memory storage of the input and output vectorsz This is primarily what makes hand or automatic vectorization difficultz The XY feature of TigerSHARC is easiest to use on separate vector operationsz A = B + C (vector op) || D = E + F (vector op)Automatic Vectorizationz Loop Unrollingz To provide more than one vector element per iterationz Alignmentz To satisfy load / store alignment restrictionsz Vector register allocationz To map vectors into the register setz This is a very incomplete list, but it is something like a minimum requirementLoop Unrollingfor (i=0; i<100; ++i){a[i] = b[i] + c[i];}for (i=0; i<100; i += 4){a[i+0] = b[i+0] + c[i+0];a[i+1] = b[i+1] + c[i+1];a[i+2] = b[i+2] + c[i+2];a[i+3] = b[i+3] + c[i+3];}Alignment: Analysisvoid f(int *a, int *b, int *c){for (i=0; i<100; ++i){a[i] = b[i] + c[i];}}void f(int *a, int *b, int *c){for (i=0; i<100; i += 4){a[i+0] = b[i+0] + c[i+0];a[i+1] = b[i+1] + c[i+1];a[i+2] = b[i+2] + c[i+2];a[i+3] = b[i+3] + c[i+3];}}Can we loadc[i+3] : c[i+0]using a quad load?Alignmentvoid f(int *a, int *b, int *c){for (i=0; i<100; ++i){a[i] = b[i-1] + c[i];}}Why is this example difficult?What will happen (concerning alignment) when we unroll?How can we fix this?Vector Register AllocationV1 = V2 + V3V4 = V5 + V6V7 = V2 * V3V8 = V5 * V9Vector ADDVector MULShould we allocate <V3, V6> to a vector reg? Or <V3, V9>? If V6 and V9 are simultaneously alive there is a problem!Vector Register AllocationV1 = V2 + V3V4 = V5 + V6V7 = V2 * V6V8 = V5 * V3Vector ADDVector MULShould V3 go into the HIGH half (odd-numbered) register for the ADD, or the LOW half (even-numbered) register for the

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5-6 out of 19 pages.

U of I CS 433 - Vector Computation

Sign up for free to view:

Please select your school