Berkeley COMPSCI 252 - A benchmark for sparse matrix-vector multiplication - D2955040

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 252> A benchmark for sparse matrix-vector multiplication

DOC PREVIEW

Berkeley COMPSCI 252 - A benchmark for sparse matrix-vector multiplication

School name University of California, Berkeley

Course Compsci 252- Graduate Computer Architecture

Pages 41

This preview shows page 1-2-3-19-20-39-40-41 out of 41 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

A benchmark for sparse matrix-vector multiplicationTopcs for today:Sparse matrix-vector multiplicationRegister block optimizationSMVM benchmarks: Three strategies1) Actually do SMVM2) Microbenchmarks “simulating” SMVM3) Analytical models of SMVM performanceOur SMVM benchmarkDense matrix in sparse formatData set sizingResults: “Best” block sizeSlide 13Slide 14Slide 15Rank processors acc. to benchmarks:Slide 17Our benchmark: Useful performance indicatorComparison of Benchmark with Real MatricesSlide 20Slide 21Comparison ConclusionsEvaluating SIMD instructionsVectorizing DAXPYSparsity register block layoutVector reduceSSE-2 has “vhalf”!One possible SSE-2 6x6 A*xSSE-2: gcc and Intel C compilers won't vectorize!“Small matrix library”SIMD load: alignmentSSE-2 results: DisappointingHow SSE-2 should look: STREAM ScaleNetBurst (Pentium 4,M arch) (Note: diagram used w/out permission)Slide 35Can NetBurst keep up with DAXPY?Itanium 2: Streaming fl-ptItanium 2: Alignment strikes again!SIMD conclusions:Conclusions:Conclusions (2):A benchmark for sparse matrix-vector multiplication●Hormozd Gahvari and Mark Hoemmen {hormozd|mhoemmen}@eecs●http://mhoemmen.arete.cc/Report/●Research made possible by: NSF, Argonne National Lab, a gift from Intel, National Energy Research Scientific Computing Center, and Tyler Berry [email protected] for today:●Sparse matrix-vector multiplication (SMVM) and the Sparsity optimization●Preexisting SMVM benchmarks vs. ours●Results: Performance predictors●Test case: Desktop SIMDSparse matrix-vector multiplication●Sparse vs. dense matrix * vector–Dense: Can take advantage of temporal, spatial locality (BLAS level 2,3)–Sparse: “Stream through” matrix one value at a time–Index arrays: Lose locality●Compressed sparse row (CSR) formatRegister block optimization●Many matrices have small blocks–FEM matrices especially–2x2, 3x3, 6x6 common●Register blocking: Like unrolling a loop (circumvent latencies)●Sparsity:–Automatic heuristic optimal block size selectionSMVM benchmarks: Three strategies1) Actually do SMVM with test cases2) Simpler ops “simulating” SMVM3) Analytical / heuristic model1) Actually do SMVM●SparseBench: Iterative Krylov solvers–Tests other things besides SMVM!●SciMark 2.0:–Fixed problem size–Uses unoptimized CSR (no reg. blocks)●Doesn't capture potential performance with many types of matrices●Register blocking: Large impact (will see)2) Microbenchmarks “simulating” SMVM●Goal: capture SMVM behavior with simple set of operations●STREAM http://www.streambench.org/–“Sustained memory bandwidth”–Copy, Scale, Add, Triad–Triad: like dense level-1 BLAS DAXPY●Rich Vuduc's indirect indexed variants–Resemble sparse matrix addressing–Still not predictive3) Analytical models of SMVM performance●Account for miss rates, latencies and bandwidths●Sparsity: bounds as heuristic to predict best block dimensions for a machine●Upper and lower bounds not tight, so difficult to use for performance prediction●Sparsity's goal: optimization, not performance predictionOur SMVM benchmark●Do SMVM with BSR matrix: randomly scattered blocks–BSR format: Typically less structured matrices anyway●“Best” block size, 1x1–Characterize different matrix types–Take advantage of potential optimizations (unlike current benchmarks), but in a general wayDense matrix in sparse format●Test this with optimal block size:–To show that fill doesn't affect performance much–Fill: affects locality of accesses to source vectorData set sizing●Size vectors to fit in largest cache, matrix out of cache–Tests “streaming in” of matrix values–Natural scaling to machine parameters!●“Inspiration” SPECfp92 (small enough so manufacturers could size cache to fit all data) vs. SPECfp95 (data sizes increased)–Fill now machine-dependent:●Tests show fill (locality of source vector accesses) has little effectResults: “Best” block size●Highest Mflops/s value for the block sizes tested, for:–Sparse matrix (fill chosen as above)–Dense matrix in sparse format (4096 x 4096)●Compare with Mflops/s for STREAM Triad (a[i] = b[i] + s * c[i])Rank processors acc. to benchmarks:●For optimized (best block size) SMVM:–Peak mem bandwidth good predictor for Itanium 2, P4, PM relationship–STREAM mispredicts these●STREAM:–Better predicts unoptimized (1 x 1) SMVM–Peak bandwidth no longer helpfulOur benchmark: Useful performance indicator●Comparison with results for “real-life” matrices: –Works well for FEM matrices–Not always as well for non-FEM matrices–More wasted space in block data structure: directly proportional to slowdownComparison of Benchmark with Real Matrices●Following two graphs show MFLOP rate of matrices generated by our benchmark vs. matrices from BeBOP group and a dense matrix in sparse format●Plots compare by block size; matrix “number” is given in parentheses. Matrices 2-9 are FEM matrices.●A comprehensive list of the BeBOP test suite matrices can be found in Vuduc, et. al., “Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply,” 2002.Comparison Conclusions●Our benchmark does a good job modeling real data●Dense matrix in sparse format looks good on Ultra 3, but is noticeably inferior to our benchmark for large block sizes on Itanium 2Evaluating SIMD instructions●SMVM benchmark: –Tool to evaluate arch. features●e.g.: Desktop SIMD floating-point●SSE-2 ISA:–Pentium 4, M; AMD Opteron–Parallel ops on 2 floating-point doubles●{ADD|MUL|DIV}PD: arithmetic●MOVAPD: load aligned pairVectorizing DAXPY●Register block: small dense Matrix * vector●Dep. on matrix data ordering:–Column-major (Fortran-style): ●Need scalar * vector operation–Row-major (C-style):●Need “reduce” (dot product)Sparsity register block layout●Row-major order within block–Vs. Sparse BLAS proposal (col-major)!–Vector reductions change associativity (results may differ from scalar version, due to roundoff)●We chose to keep it for now–Can't just switch algorithm: orientation affects stride of vector loads–Need a good vector reductionVector reduce●e.g. C. Kozyrakis' recent UC Berkeley Ph.D. thesis on multimedia vector ops●“vhalf” instruction:–Copy lower half of src vector reg. --> upper half of dest.●Iterate (vhalf, vector add) to reduce.SSE-2 has “vhalf”!# Sum the 2 elements

View Full Document