Unformatted text preview:

Sparse Matrices and Optimized Parallel Implementations Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee March 25 2009 CS 594 Lecture Notes 03 25 2009 Slide 1 32 Topics Projection in Scientific Computing Sparse matrices parallel implementations PDEs Numerical solution Tools etc Iterative Methods Slide 2 32 Outline Part I Discussion Part II Sparse matrix computations Part III Reordering algorithms and parallelization Slide 3 32 Part I Discussion Slide 4 32 Orthogonalization We can orthonormalize non orthogonal basis How Other approaches QR using Householder transformation as in LAPACK Cholesky or and SVD on normal equations as in homeworks 7 and 8 Hybrid CPU GPU NVIDIA Quadro FX 5600 computation as in Homework 9 CPU computation AMD Opteron tm Processor 265 1 8 Ghz 1 GB cache 140 2 5 128 vectors 120 2 80 QR it svd QR it Chol svd 60 Gflop s Gflop s 100 1 5 MGS QR it svd QR it Chol svd 1 40 0 5 20 0 0 1 2 4 8 16 32 64 Vector size x 1 000 128 256 512 1 2 4 8 16 32 64 Vector size x 1 000 128 256 512 Slide 5 32 What if the basis is not orthonormal If we do not want to orthonormalize u P u c1 x1 c2 x2 cm xm Multiply by x1 xm to get u x1 c1 x1 x1 c2 x2 x1 cm xm x1 u xm c1 x1 xm c2 x2 xm cm xm xm These are the so called Petrov Galerkin conditions We saw examples of their use in optimization and PDE discretization e g FEM Slide 6 32 What if the basis is not orthonormal If we do not want to orthonormalize e g in FEM u P u c1 1 c2 2 c7 7 Multiply by 1 7 to get a 7x7 system a c1 1 c2 2 c7 7 i F i for i 1 7 Two examples of basis functions i The more i overlap the denser the resulting matrix Spectral element methods high order FEM Image taken from http www urel feec vutbr cz raida Slide 7 32 Stencil Computations K Datta S Kamil S Williams L Oliker J Shaft K Yelick Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors SIAM Review 2008 http bebop cs berkeley edu pubs datta2008 stencil sirev pdf Slide 8 32 Part II Sparse matrix computations Slide 9 32 Sparse matrices Sparse matrix substantial part of the coefficients is zero Naturally arise from PDE discretizations finite differences FEM etc we saw examples in the 5 point finite difference operator 1 D piece wise linear FEM 10 Vertices are indexed eg 5 6 7 2 Row 6 will have 5 nonzero elements Row 3 for example will have 3 nonzeros A6 2 A6 5 A6 6 A6 7 and A6 10 A3 2 A3 3 A3 4 Slide 10 32 Sparse matrices In general 100 115 Degrees of freedom DOF associated for ex with vertices or edges faces etc are indexed A basis function is associated with every DOF unknown 332 10 A Petrov Galerkin condition equation is derived for every basis function representing a row in the resulting system 201 35 Only a few elements per row will be nonzero as the basis functions have local support eg row 10 using continuous piecewise linear FEM will have 6 nonzeroes A10 10 A10 35 A10 100 A10 332 A10 115 A10 201 physical intuition behind PDEs describe changes in physical processes describing discretizing these changes numerically based only on local neighboring information results in sparse matrices eg what happens at 10 is described by the physical state at 10 and the neighboring 35 201 115 100 and 332 Slide 11 32 Sparse matrices Can we take advantage of this sparse structure To solve for example very large problems To solve them efficiently Yes There are algorithms Linear solvers and preconditioners to cover some in the last 2 lectures Efficient data storage and implementation next Slide 12 32 Sparse matrix formats It pays to avoid storing the zeros Common sparse storage formats AIJ Compressed row column storage CRS CCS Compressed diagonal storage CDS for more see the Templates book http www netlib org linalg html templates node90 html SECTION00931000000000000000 Blocked versions why Slide 13 32 AIJ Stored in 3 arrays The same length No order implied I J AIJ 1 1 2 2 3 3 4 4 1 2 1 3 2 4 3 5 1 2 3 4 5 6 7 8 Slide 14 32 CRS Stored in 3 arrays J and AIJ the same length I representing rows is compressed I J AIJ 1 3 5 7 9 1 2 1 3 2 4 3 5 1 2 3 4 5 6 7 8 array I think of it as pointers to where next row starts CCS similar but J is compressed Slide 15 32 CDS For matrices with non zeros along sub diogonals Sub diagonal index subdiogonals 1 0 1 Slide 16 32 Performance Mat vec product Notoriously bad for running at just a fraction of the performance peak Why Consider Mat vec product for matrix in CRS for i 1 n for j I i I i 1 1 y i AIJ j x J j Slide 17 32 Performance Mat vec product Notoriously bad for running at just a fraction of the performance peak Why Consider Mat vec product for matrix in CRS for i 1 n for j I i I i 1 1 y i AIJ j x J j Irregular indirect memory access for x result in cache trashing performance often 100 115 10 peak 332 10 201 35 Slide 18 32 Performance Mat vec product Performance of mat vec products of various sizes on a 2 4 GHz Pentium 4 An example from Gahvari et al http bebop cs berkeley edu pubs gahvari2007 spmvbench spec pdf Slide 19 32 Performance Mat vec product How to improve the performance A common technique as done for dense linear algebra is blocking register cache next Index reordering in Part II Exploit special matrix structure eg symmetry bands other structures Slide 20 32 Block Compressed Row Storage BCRS Example of using 2x2 blocks BI BJ AIJ 1 3 4 1 2 3 1 0 0 4 2 3 5 0 6 7 8 9 Reduced storage for indexes Drawback add 0s What block size to choose BCRS for register blocking Discussion Slide 21 32 BCRS Slide 22 32 Cache blocking Improve cache reuse for x in Ax by splitting A into a set of sparse matrices e g Sparse matrix and its splitting For more info check SPARSITY An Optimization Framework for Sparse Matrix Kernels International Journal of High Performance Computing Applications 18 1 pp 135 158 February 2004 Eun Jin Im K Yelick R Vuduc Slide 23 32 Part III Reordering algorithms and Parallelization Slide 24 32 Reorder to preserve locality 100 115 332 10 201 35 eg Cuthill McKee Ordering start from arbitrary node say 10 and reorder 10 becomes 0 neighbours are ordered next to become 1 2 3 4 5 denote this as level …


View Full Document

UTK CS 594 - Sparse Matrices and Optimized Parallel Implementations

Documents in this Course
Load more
Loading Unlocking...
Login

Join to view Sparse Matrices and Optimized Parallel Implementations and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Sparse Matrices and Optimized Parallel Implementations and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?