DOC PREVIEW
U of I CS 433 - Computer System Organization

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS433G: Computer System OrganizationWhat is Vector Computation?Simplest Case: Vector / VectorVector / Vector With Different Result TypeVariation: Vector / ScalarVariation: ReductionTigerSHARC ReductionSum Reduction Using PR RegsXY Parallelism in TigerSHARCXY Operations on TSXY ParallelismUsing XY to Implement VectorsVector Computation and MemoryAutomatic VectorizationLoop UnrollingAlignment: AnalysisAlignmentVector Register AllocationSlide 19CS433G: Computer System OrganizationLuddy HarrisonVector ComputationTigerSHARC Examples VectorizationWhat is Vector Computation?Vector computation is a simple form of SIMD computationSingle Instruction Multiple DataThe data are packed into vectorsTuples of a primitive data typeThe ALU operates directly on these tuplesWe will write e.g. 4x16 for such a typeThe underlying primitive type is taken from contextSimplest Case: Vector / VectorA B C DE F G H+ +++A+E B+F C+G D+H= = = =Here the output type is the same as the two input types.XSR1:0 = R4:3 + R7:6 (S)TigerSHARCVector / Vector With Different Result TypeA B C DE F G HX XXXA×E B×F C×G D×H== ==4x16 × 4x16 → 4x32YR3:0 = R7:6 * R5:4Note quad output register YR3:0Variation: Vector / ScalarA B C DX X X X+ +++A+X B+X C+X D+X= = = =Scalar: XVector <A,B,C,D>Variation: ReductionA B C D+ ++A+B+C+DTigerSHARC ReductionSum Reduction Using PR RegsPR0 += SUM SR5:4XY Parallelism in TigerSHARCWe can (sort of) look at this as an additional “×2” vector capacityR3:0 = R5:4 * R7:6This does 4 multiply / adds on X and 4 on Y (8 total)As if < X5:4, Y5:4 > were one vectorXY Operations on TSXYXY ParallelismThis view however is difficult to maintain in light of loads and storesXR5:4 = [ J2 ] ; YR5:4 = [ K2 ]If K2 = J2 + 2 then this loads the 4-word vector into <XR5:4, YR5:4>Using XY to Implement VectorsJ2K2What are the problems here (there are two of them)?Vector Computation and MemoryThe structure of in-register vector operations mirrors closely the structure of the in-memory storage of the input and output vectorsThis is primarily what makes hand or automatic vectorization difficultThe XY feature of TigerSHARC is easiest to use on separate vector operationsA = B + C (vector op) || D = E + F (vector op)Automatic VectorizationLoop UnrollingTo provide more than one vector element per iterationAlignmentTo satisfy load / store alignment restrictionsVector register allocationTo map vectors into the register setThis is a very incomplete list, but it is something like a minimum requirementLoop Unrollingfor (i=0; i<100; ++i){ a[i] = b[i] + c[i];}for (i=0; i<100; i += 4){ a[i+0] = b[i+0] + c[i+0]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3];}Alignment: Analysisvoid f(int *a, int *b, int *c){ for (i=0; i<100; ++i) { a[i] = b[i] + c[i]; }}void f(int *a, int *b, int *c){ for (i=0; i<100; i += 4) { a[i+0] = b[i+0] + c[i+0]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3]; }}Can we load c[i+3] : c[i+0]using a quad load?Alignmentvoid f(int *a, int *b, int *c){ for (i=0; i<100; ++i) { a[i] = b[i-1] + c[i]; }}Why is this example difficult?What will happen (concerning alignment) when we unroll?How can we fix this?Vector Register AllocationV1 = V2 + V3V4 = V5 + V6V7 = V2 * V3V8 = V5 * V9Vector ADDVector MULShould we allocate <V3, V6> to a vector reg? Or <V3, V9>? If V6 and V9 are simultaneously alive there is a problem!Vector Register AllocationV1 = V2 + V3V4 = V5 + V6V7 = V2 * V6V8 = V5 * V3Vector ADDVector MULShould V3 go into the HIGH half (odd-numbered) register for the ADD, or the LOW half (even-numbered) register for the


View Full Document

U of I CS 433 - Computer System Organization

Download Computer System Organization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Computer System Organization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Computer System Organization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?