CS433G: Computer System OrganizationWhat is Vector Computation?Simplest Case: Vector / VectorVector / Vector With Different Result TypeVariation: Vector / ScalarVariation: ReductionTigerSHARC ReductionSum Reduction Using PR RegsXY Parallelism in TigerSHARCXY Operations on TSXY ParallelismUsing XY to Implement VectorsVector Computation and MemoryAutomatic VectorizationLoop UnrollingAlignment: AnalysisAlignmentVector Register AllocationSlide 19CS433G: Computer System OrganizationLuddy HarrisonVector ComputationTigerSHARC Examples VectorizationWhat is Vector Computation?Vector computation is a simple form of SIMD computationSingle Instruction Multiple DataThe data are packed into vectorsTuples of a primitive data typeThe ALU operates directly on these tuplesWe will write e.g. 4x16 for such a typeThe underlying primitive type is taken from contextSimplest Case: Vector / VectorA B C DE F G H+ +++A+E B+F C+G D+H= = = =Here the output type is the same as the two input types.XSR1:0 = R4:3 + R7:6 (S)TigerSHARCVector / Vector With Different Result TypeA B C DE F G HX XXXA×E B×F C×G D×H== ==4x16 × 4x16 → 4x32YR3:0 = R7:6 * R5:4Note quad output register YR3:0Variation: Vector / ScalarA B C DX X X X+ +++A+X B+X C+X D+X= = = =Scalar: XVector <A,B,C,D>Variation: ReductionA B C D+ ++A+B+C+DTigerSHARC ReductionSum Reduction Using PR RegsPR0 += SUM SR5:4XY Parallelism in TigerSHARCWe can (sort of) look at this as an additional “×2” vector capacityR3:0 = R5:4 * R7:6This does 4 multiply / adds on X and 4 on Y (8 total)As if < X5:4, Y5:4 > were one vectorXY Operations on TSXYXY ParallelismThis view however is difficult to maintain in light of loads and storesXR5:4 = [ J2 ] ; YR5:4 = [ K2 ]If K2 = J2 + 2 then this loads the 4-word vector into <XR5:4, YR5:4>Using XY to Implement VectorsJ2K2What are the problems here (there are two of them)?Vector Computation and MemoryThe structure of in-register vector operations mirrors closely the structure of the in-memory storage of the input and output vectorsThis is primarily what makes hand or automatic vectorization difficultThe XY feature of TigerSHARC is easiest to use on separate vector operationsA = B + C (vector op) || D = E + F (vector op)Automatic VectorizationLoop UnrollingTo provide more than one vector element per iterationAlignmentTo satisfy load / store alignment restrictionsVector register allocationTo map vectors into the register setThis is a very incomplete list, but it is something like a minimum requirementLoop Unrollingfor (i=0; i<100; ++i){ a[i] = b[i] + c[i];}for (i=0; i<100; i += 4){ a[i+0] = b[i+0] + c[i+0]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3];}Alignment: Analysisvoid f(int *a, int *b, int *c){ for (i=0; i<100; ++i) { a[i] = b[i] + c[i]; }}void f(int *a, int *b, int *c){ for (i=0; i<100; i += 4) { a[i+0] = b[i+0] + c[i+0]; a[i+1] = b[i+1] + c[i+1]; a[i+2] = b[i+2] + c[i+2]; a[i+3] = b[i+3] + c[i+3]; }}Can we load c[i+3] : c[i+0]using a quad load?Alignmentvoid f(int *a, int *b, int *c){ for (i=0; i<100; ++i) { a[i] = b[i-1] + c[i]; }}Why is this example difficult?What will happen (concerning alignment) when we unroll?How can we fix this?Vector Register AllocationV1 = V2 + V3V4 = V5 + V6V7 = V2 * V3V8 = V5 * V9Vector ADDVector MULShould we allocate <V3, V6> to a vector reg? Or <V3, V9>? If V6 and V9 are simultaneously alive there is a problem!Vector Register AllocationV1 = V2 + V3V4 = V5 + V6V7 = V2 * V6V8 = V5 * V3Vector ADDVector MULShould V3 go into the HIGH half (odd-numbered) register for the ADD, or the LOW half (even-numbered) register for the
View Full Document