Notes on Homework 1Summary of SSE intrinsicsExample: multiplying 2x2 matricesOther Issues02/11/2009 CS267 Lecture 71Notes on Homework 1•Must write SIMD code to get past 50% of peak!02/11/2009 CS267 Lecture 72Summary of SSE intrinsicsVector data type:•__m128dLoad and store operations:•_mm_load_pd•_mm_store_pd•_mm_loadu_pd•_mm_storeu_pdLoad and broadcast across vector•_mm_load1_pdArithmetic:•_mm_add_pd•_mm_mul_pd02/11/2009 CS267 Lecture 73Example: multiplying 2x2 matricesc1 = _mm_loadu_pd( C+0*lda ) //load unaligned block in Cc2 = _mm_loadu_pd( C+1*lda )for( int i = 0; i < 2; i++ ){a = _mm_load_pd( A+i*lda ) //load aligned i-th column of Ab1 = _mm_load1_pd( B+i+0*lda ) //load i-th row of Bb2 = _mm_load1_pd( B+i+1*lda )c1=_mm_add_pd( c1, _mm_mul_pd( a, b1 ) ); //rank-1 updatec2=_mm_add_pd( c2, _mm_mul_pd( a, b2 ) );}_mm_storeu_pd( C+0*lda, c1 ); //store unaligned block in C_mm_storeu_pd( C+1*lda, c2 );02/11/2009 CS267 Lecture 74Other Issues•Checking efficiency of the compiler helps•Use -S option to see the generated assembly code•Inner loop should consist mostly of ADDPD and MULPD ops•ADDSD and MULSD imply scalar computations•Consider using another compiler•Options are PGI, PathScale and GNU•I found it easier to do with GNU compiler•Look through Goto and van de Geijn’s
View Full Document