A benchmark for sparse matrix-vector multiplicationTopcs for today:Sparse matrix-vector multiplicationRegister block optimizationSMVM benchmarks: Three strategies1) Actually do SMVM2) Microbenchmarks “simulating” SMVM3) Analytical models of SMVM performanceOur SMVM benchmarkDense matrix in sparse formatData set sizingResults: “Best” block sizeSlide 13Slide 14Slide 15Rank processors acc. to benchmarks:Slide 17Our benchmark: Useful performance indicatorComparison of Benchmark with Real MatricesSlide 20Slide 21Comparison ConclusionsEvaluating SIMD instructionsVectorizing DAXPYSparsity register block layoutVector reduceSSE-2 has “vhalf”!One possible SSE-2 6x6 A*xSSE-2: gcc and Intel C compilers won't vectorize!“Small matrix library”SIMD load: alignmentSSE-2 results: DisappointingHow SSE-2 should look: STREAM ScaleNetBurst (Pentium 4,M arch) (Note: diagram used w/out permission)Slide 35Can NetBurst keep up with DAXPY?Itanium 2: Streaming fl-ptItanium 2: Alignment strikes again!SIMD conclusions:Conclusions:Conclusions (2):A benchmark for sparse matrix-vector multiplication●Hormozd Gahvari and Mark Hoemmen {hormozd|mhoemmen}@eecs●http://mhoemmen.arete.cc/Report/●Research made possible by: NSF, Argonne National Lab, a gift from Intel, National Energy Research Scientific Computing Center, and Tyler Berry [email protected] for today:●Sparse matrix-vector multiplication (SMVM) and the Sparsity optimization●Preexisting SMVM benchmarks vs. ours●Results: Performance predictors●Test case: Desktop SIMDSparse matrix-vector multiplication●Sparse vs. dense matrix * vector–Dense: Can take advantage of temporal, spatial locality (BLAS level 2,3)–Sparse: “Stream through” matrix one value at a time–Index arrays: Lose locality●Compressed sparse row (CSR) formatRegister block optimization●Many matrices have small blocks–FEM matrices especially–2x2, 3x3, 6x6 common●Register blocking: Like unrolling a loop (circumvent latencies)●Sparsity:–Automatic heuristic optimal block size selectionSMVM benchmarks: Three strategies1) Actually do SMVM with test cases2) Simpler ops “simulating” SMVM3) Analytical / heuristic model1) Actually do SMVM●SparseBench: Iterative Krylov solvers–Tests other things besides SMVM!●SciMark 2.0:–Fixed problem size–Uses unoptimized CSR (no reg. blocks)●Doesn't capture potential performance with many types of matrices●Register blocking: Large impact (will see)2) Microbenchmarks “simulating” SMVM●Goal: capture SMVM behavior with simple set of operations●STREAM http://www.streambench.org/–“Sustained memory bandwidth”–Copy, Scale, Add, Triad–Triad: like dense level-1 BLAS DAXPY●Rich Vuduc's indirect indexed variants–Resemble sparse matrix addressing–Still not predictive3) Analytical models of SMVM performance●Account for miss rates, latencies and bandwidths●Sparsity: bounds as heuristic to predict best block dimensions for a machine●Upper and lower bounds not tight, so difficult to use for performance prediction●Sparsity's goal: optimization, not performance predictionOur SMVM benchmark●Do SMVM with BSR matrix: randomly scattered blocks–BSR format: Typically less structured matrices anyway●“Best” block size, 1x1–Characterize different matrix types–Take advantage of potential optimizations (unlike current benchmarks), but in a general wayDense matrix in sparse format●Test this with optimal block size:–To show that fill doesn't affect performance much–Fill: affects locality of accesses to source vectorData set sizing●Size vectors to fit in largest cache, matrix out of cache–Tests “streaming in” of matrix values–Natural scaling to machine parameters!●“Inspiration” SPECfp92 (small enough so manufacturers could size cache to fit all data) vs. SPECfp95 (data sizes increased)–Fill now machine-dependent:●Tests show fill (locality of source vector accesses) has little effectResults: “Best” block size●Highest Mflops/s value for the block sizes tested, for:–Sparse matrix (fill chosen as above)–Dense matrix in sparse format (4096 x 4096)●Compare with Mflops/s for STREAM Triad (a[i] = b[i] + s * c[i])Rank processors acc. to benchmarks:●For optimized (best block size) SMVM:–Peak mem bandwidth good predictor for Itanium 2, P4, PM relationship–STREAM mispredicts these●STREAM:–Better predicts unoptimized (1 x 1) SMVM–Peak bandwidth no longer helpfulOur benchmark: Useful performance indicator●Comparison with results for “real-life” matrices: –Works well for FEM matrices–Not always as well for non-FEM matrices–More wasted space in block data structure: directly proportional to slowdownComparison of Benchmark with Real Matrices●Following two graphs show MFLOP rate of matrices generated by our benchmark vs. matrices from BeBOP group and a dense matrix in sparse format●Plots compare by block size; matrix “number” is given in parentheses. Matrices 2-9 are FEM matrices.●A comprehensive list of the BeBOP test suite matrices can be found in Vuduc, et. al., “Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply,” 2002.Comparison Conclusions●Our benchmark does a good job modeling real data●Dense matrix in sparse format looks good on Ultra 3, but is noticeably inferior to our benchmark for large block sizes on Itanium 2Evaluating SIMD instructions●SMVM benchmark: –Tool to evaluate arch. features●e.g.: Desktop SIMD floating-point●SSE-2 ISA:–Pentium 4, M; AMD Opteron–Parallel ops on 2 floating-point doubles●{ADD|MUL|DIV}PD: arithmetic●MOVAPD: load aligned pairVectorizing DAXPY●Register block: small dense Matrix * vector●Dep. on matrix data ordering:–Column-major (Fortran-style): ●Need scalar * vector operation–Row-major (C-style):●Need “reduce” (dot product)Sparsity register block layout●Row-major order within block–Vs. Sparse BLAS proposal (col-major)!–Vector reductions change associativity (results may differ from scalar version, due to roundoff)●We chose to keep it for now–Can't just switch algorithm: orientation affects stride of vector loads–Need a good vector reductionVector reduce●e.g. C. Kozyrakis' recent UC Berkeley Ph.D. thesis on multimedia vector ops●“vhalf” instruction:–Copy lower half of src vector reg. --> upper half of dest.●Iterate (vhalf, vector add) to reduce.SSE-2 has “vhalf”!# Sum the 2 elements
View Full Document