Berkeley COMPSCI C267 - Sparse Matrix Methods on High Performance Computers - D2320617

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Sparse Matrix Methods on High Performance Computers

DOC PREVIEW

Berkeley COMPSCI C267 - Sparse Matrix Methods on High Performance Computers

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 50

This preview shows page 1-2-3-24-25-26-27-48-49-50 out of 50 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Sparse Matrix Methods on High Performance Computers X Sherry Li xsli lbl gov http crd lbl gov xiaoye CS267 EngC233 Applications of Parallel Computers March 16 2010 Sparse linear solvers for unstructured matrices Solving a system of linear equations Ax b Iterative methods A is not changed read only Key kernel sparse matrix vector multiply Easier to optimize and parallelize Low algorithmic complexity but may not converge for hard problems Direct methods A is modified factorized Harder to optimize and parallelize Numerically robust but higher algorithmic complexity Often use direct method to precondition iterative method Increasingly more interest on hybrid methods CS267 2 Lecture Plan Direct methods Sparse factorization Sparse compressed formats Deal with many graph algorithms directed undirected graphs paths elimination trees depth first search heuristics for NP hard problems cliques graph partitioning cache friendly dense matrix kernels and more Preconditioners Incomplete factorization Hybrid method Domain decomposition CS267 3 Available sparse factorization codes Survey of different types of factorization codes http crd lbl gov xiaoye SuperLU SparseDirectSurvey pdf LLT s p d LDLT symmetric indefinite LU nonsymmetric QR least squares Sequential shared memory multicore distributed memory out of core Distributed memory codes usually MPI based SuperLU DIST Li Demmel Grigori accessible from PETSc Trilinos MUMPS PasTiX WSMP CS267 4 Review of Gaussian Elimination GE First step of GE A v wT 1 B v 0 I 0 wT C v w T C B Repeats GE on C Results in LU factorization A LU L lower triangular with unit diagonal U upper triangular Then x is obtained by solving two triangular systems with L and U CS267 5 Sparse GE Sparse matrices are ubiquitous Example A of dimension 106 10 100 nonzeros per row Nonzero costs flops and memory Scalar algorithm 3 nested loops Can re arrange loops to get different variants left looking right looking for i 1 to n column scale A i for k i 1 to n s t A i k 0 for j i 1 to n s t A j i 0 A j k A j k A j i A i k 1 U 2 3 4 L 5 6 7 Typical fill ratio 10x for 2D problems 30 50x for 3D problems CS267 6 Early Days Envelope Profile solver Define bandwidth for each row or column A little more sophisticated than band solver Use Skyline storage SKS Lower triangle stored row by row Upper triangle stored column by column In each row column first nonzero defines a profile All entries within the profile some may be zeros are stored All fill ins are confined in the profile A good ordering would be based on bandwidth reduction E g reverse Cuthill McKee CS267 7 RCM ordering Breadth first search numbering by levels then reverse CS267 8 Is Profile Solver Good Enough Example 3 orderings natural RCM Minimumdegree Envelop size sum of bandwidths After LU envelop would be entirelyEnv filled 61066 Env 31775 CS267 Env 22320 NNZ L MD 12259 9 A General Data Structure Compressed Column Storage CCS Also known as Harwell Boeing format Store nonzeros columnwise contiguously 3 arrays Storage NNZ reals NNZ N 1 integers Efficient for columnwise algorithms nzval rowind colptr 1 c 2 d e 3 k a 4 h 1 2 c d e a b 3 k 4 f 5 h i l 6 g j 7 b f 5 i l 6 g j 7 1 3 2 3 4 3 7 1 4 6 2 4 5 67 6 5 6 7 1 3 6 8 11 16 17 20 Templates for the Solution of Linear Systems Building Blocks for Iterative Methods R Barrett et al CS267 10 General Sparse Solver Use blocked CRS or CCS and any ordering method Leave room for fill ins symbolic factorization Exploit supernodal dense structures in the factors Can use Level 3 BLAS Reduce inefficient indirect addressing scatter gather Reduce graph traversal time using a coarser graph CS267 11 Numerical Stability Need for Pivoting One step of GE A v wT 1 B v 0 I 0 wT C v w T C B If is small some entries in B may be lost from addition Pivoting swap the current diagonal entry with a larger entry from the other part of the matrix C Goal prevent CS267 from getting too large 12 Dense versus Sparse GE Dense GE Pr A Pc LU Pr and Pc are permutations chosen to maintain stability Partial pivoting suffices in most cases Pr A LU Sparse GE Pr A Pc LU Pr and Pc are chosen to maintain stability and preserve sparsity and increase parallelism Dynamic pivoting causes dynamic structural change Alternatives threshold pivoting static x s x pivoting x b CS267 x x x 13 Algorithmic Issues in Sparse GE Minimize number of fill ins maximize parallelism Sparsity structure of L U depends on that of A which can be changed by row column permutations vertex re labeling of the underlying graph Ordering combinatorial algorithms NP complete to find optimum Yannakis 83 use heuristics Predict the fill in positions in L U Symbolic factorization combinatorial algorithms Perform factorization and triangular solutions Numerical algorithms F P operations only on nonzeros How and when to pivot Usually dominate the total runtime CS267 14 Ordering RCM is good for profile solver General unstructured methods Minimum degree locally greedy Nested dissection divided conquer suitable for parallelism CS267 15 Ordering Minimum Degree 1 3 Local greedy minimize upper bound on fill in 1 i j k l x x x x x x i j x k x l x 1 i Eliminate 1 i j 1 k j k l x x x x x x i j x k x i j l k l x Eliminate 1 l CS267 16 Minimum Degree Ordering 2 3 At each step Eliminate the vertex with the smallest degree Update degrees of the neighbors Greedy principle do the best locally Best for modest size problems Hard to parallelize Straightforward implementation is slow and requires too much memory Newly added edges are more than eliminated vertices CS267 17 Minimum Degree Ordering 3 3 Use quotient graph as a compact representation George Liu 78 Collection of cliques resulting from the eliminated vertices affects the degree of an uneliminated vertex Represent each connected component in the eliminated subgraph by a single supervertex Storage required to implement QG model is bounded by size of A Large body of literature on implementation variants Tinney Walker 67 George Liu 79 Liu 85 Amestoy Davis Duff 94 Ashcraft 95 Duff Reid 95 et al CS267 18 Nested Dissection Ordering 1 3 Model problem discretized system Ax b from certain PDEs e g 5 point stencil on k x k grid N k 2 Theorem ND ordering gave optimal complexity in exact arithmetic George 73 Hoffman Martin Rose Eisenstat Schultz and Sherman 2D kxk N grids O N logN memory O N3 2 operations 3D kxkxk N grids O N4 3 memory O N2 operations CS267 19 ND Ordering 2 3 Generalized nested dissection Lipton Rose Tarjan 79 Global graph

View Full Document

Berkeley COMPSCI C267 - Sparse Matrix Methods on High Performance Computers

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-24-25-26-27-48-49-50 out of 50 pages.

Berkeley COMPSCI C267 - Sparse Matrix Methods on High Performance Computers

Sign up for free to view:

Please select your school