U of I CS 433 - Computer System Organization - D2458733

Home> Schools> University of Illinois> Computer Science (CS) > CS 433> Computer System Organization

DOC PREVIEW

U of I CS 433 - Computer System Organization

School name University of Illinois

Course Cs 433- Computer System Organization

Pages 19

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS433G: Computer System OrganizationWhat is Vector Computation?Simplest Case: Vector / VectorVector / Vector With Different Result TypeVariation: Vector / ScalarVariation: ReductionTigerSHARC ReductionSum Reduction Using PR RegsXY Parallelism in TigerSHARCXY Operations on TSXY ParallelismUsing XY to Implement VectorsVector Computation and MemoryAutomatic VectorizationLoop UnrollingAlignment: AnalysisAlignmentVector Register AllocationSlide 19CS433G: Computer System OrganizationLuddy HarrisonVector ComputationTigerSHARC Examples VectorizationWhat is Vector Computation?Vector computation is a simple form of SIMD computationSingle Instruction Multiple DataThe data are packed into vectorsTuples of a primitive data typeThe ALU operates directly on these tuplesWe will write e.g. 4x16 for such a typeThe underlying primitive type is taken from contextSimplest Case: Vector / VectorA B C DE F G H+ +++A+E B+F C+G D+H= = = =Here the output type is the same as the two input types.XSR1:0 = R4:3 + R7:6 (S)TigerSHARCVector / Vector With Different Result TypeA B C DE F G HX XXXA×E B×F C×G D×H== ==4x16 × 4x16 → 4x32YR3:0 = R7:6 * R5:4Note quad output register YR3:0Variation: Vector / ScalarA B C DX X X X+ +++A+X B+X C+X D+X= = = =Scalar: XVector <A,B,C,D>Variation: ReductionA B C D+ ++A+B+C+DTigerSHARC ReductionSum Reduction Using PR RegsPR0 += SUM SR5:4XY Parallelism in TigerSHARCWe can (sort of) look at this as an additional “×2” vector capacityR3:0 = R5:4 * R7:6This does 4 multiply / adds on X and 4 on Y (8 total)As if < X5:4, Y5:4 > were one vectorXY Operations on TSXYXY ParallelismThis view however is difficult to maintain in light of loads and storesXR5:4 = [ J2 ] ; YR5:4 = [ K2 ]If K2 = J2 + 2 then this loads the 4-word vector into <XR5:4, YR5:4>Using XY to Implement VectorsJ2K2What are the problems here (there are two of them)?Vector Computation and MemoryThe structure of in-register vector operations mirrors closely the structure of the in-memory storage of the input and output vectorsThis is primarily what makes hand or automatic vectorization difficultThe XY feature of TigerSHARC is easiest to use on separate vector operationsA = B + C (vector op) || D = E + F (vector op)Automatic VectorizationLoop UnrollingTo provide more than one vector element per iterationAlignmentTo satisfy load / store alignment restrictionsVector register allocationTo map vectors into the register setThis is a very incomplete list, but it is something like a minimum requirementLoop Unrollingfor (i=0; i<100; ++i){ a[i] = b[i] + c[i];}for (i=0; i<100; i += 4){ a[i+0] = b[i+0] + c[i+0]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3];}Alignment: Analysisvoid f(int *a, int *b, int *c){ for (i=0; i<100; ++i) { a[i] = b[i] + c[i]; }}void f(int *a, int *b, int *c){ for (i=0; i<100; i += 4) { a[i+0] = b[i+0] + c[i+0]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3]; }}Can we load c[i+3] : c[i+0]using a quad load?Alignmentvoid f(int *a, int *b, int *c){ for (i=0; i<100; ++i) { a[i] = b[i-1] + c[i]; }}Why is this example difficult?What will happen (concerning alignment) when we unroll?How can we fix this?Vector Register AllocationV1 = V2 + V3V4 = V5 + V6V7 = V2 * V3V8 = V5 * V9Vector ADDVector MULShould we allocate <V3, V6> to a vector reg? Or <V3, V9>? If V6 and V9 are simultaneously alive there is a problem!Vector Register AllocationV1 = V2 + V3V4 = V5 + V6V7 = V2 * V6V8 = V5 * V3Vector ADDVector MULShould V3 go into the HIGH half (odd-numbered) register for the ADD, or the LOW half (even-numbered) register for the

View Full Document