Sparse Direct Solvers on High Performance ComputersReview of Gaussian Elimination (GE)Sparse GECompressed Column Storage (CCS)Numerical Stability: Need for PivotingDense versus Sparse GEAlgorithmic Issues in Sparse GENumerical PivotingStatic Pivoting via Weighted Bipartite MatchingNumerical Accuracy: GESP versus GEPPStructural Gaussian Elimination - Symmetric CaseMinimum Degree Ordering (1/2)Minimum Degree Ordering (2/2)Nested Dissection Ordering (1/2)Nested Dissection Ordering (2/2)Ordering for LU (unsymmetric)Ordering for LUStructural Gaussian Elimination - Unsymmetric CaseResults of Markowitz OrderingHigh Performance Issues: Reduce Cost of Memory Access & CommunicationBlocking in Sparse GESpeedup Over Un-blocked CodeParallel Task Scheduling for SMPs (in SuperLU_MT)Parallelism from Separator TreeParallel Symbolic Factorization [Grigori/Demmel/Li ‘06]Matrix Distribution on Large Distributed-memory Machine2D Block Cyclic Layout for Sparse L and U (in SuperLU_DIST)Scalability and Isoefficiency AnalysisScalabilityIrregular MatricesAdoptions of SuperLUSummaryOpen ProblemsModel ProblemSuperfast Factorization: Exploit Low-rank PropertyResults for the Model ProblemResearch IssuesThe EndApplication 1: Quantum MechanicsQuantum Mechanics (cont.)SuperLU_DIST as PreconditionerOne Block Timings on IBM SPApplication 2: Accelerator Cavity DesignAccelerator (cont.)Slide 45DDS47, Linear ElementsLargest Eigen Problem Solved So FarSparse Direct Solvers on High Performance ComputersSparse Direct Solvers on High Performance ComputersX. Sherry [email protected]://crd.lbl.gov/~xiaoyeCS267: Applications of Parallel ComputersApril 25, 2006CS2672Review of Gaussian Elimination (GE)Review of Gaussian Elimination (GE)Solving a system of linear equations Ax = bFirst step of GE:Repeats GE on CResults in LU factorization (A = LU)L lower triangular with unit diagonal, U upper triangularThen, x is obtained by solving two triangular systems with L and UCwIvBvwATT0/01TwvBCCS2673Sparse GESparse GESparse systems are ubiquitousExample: A of dimension 105, only 10~100 nonzeros per rowGoal: Store only nonzeros and perform operations only on nonzerosFill-in: original zero entry aij becomes nonzero in L and UNatural order: nonzeros = 233 Min. Degree order: nonzeros = 207CS2674Compressed Column Storage (CCS)Compressed Column Storage (CCS)Also known as Harwell-Boeing formatStore nonzeros columnwise contiguously3 arrays:Storage: NNZ reals, NNZ+N+1 integersEfficient for columnwise algorithmsRef: Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, R. Barrett et al.7654321lkjihgfedcbanzval 1 c 2 d e 3 k a 4 h b f 5 i l 6 g j 7 rowind 1 3 2 3 4 3 7 1 4 6 2 4 5 6 7 6 5 6 7colptr 1 3 6 8 11 16 17 20CS2675Numerical Stability: Need for PivotingNumerical Stability: Need for PivotingOne step of GE: If α is small, some entries in B may be lost from additionPivoting: swap the current diagonal entry with a larger entry from the other part of the matrixGoal: prevent from getting too largeCwIvBvwATT0/01TwvBCCCS2676Dense versus Sparse GEDense versus Sparse GEDense GE: Pr A Pc = LUPr and Pc are permutations chosen to maintain stabilityPartial pivoting suffices in most cases : PrA = LUSparse GE: Pr A Pc = LUPr and Pc are chosen to maintain stability and preserve sparsityCS2677Algorithmic Issues in Sparse GEAlgorithmic Issues in Sparse GEMinimize number of fill-ins, maximize parallelismSparsity structure of L & U depends on that of A, which can be changed by row/column permutations (vertex re-labeling of the underlying graph)Ordering (combinatorial algorithms; NP-complete to find optimum [Yannakis ’83]; use heuristics)Predict the fill-in positions in L & USymbolic factorization (combinatorial algorithms)Perform factorization and triangular solutionsNumerical algorithms (F.P. operations only on nonzeros)How and when to pivot ?Usually dominate the total runtimeCS2678Numerical PivotingNumerical PivotingGoal of pivoting is to control element growth in L & U for stabilityFor sparse factorizations, often relax the pivoting rule to trade with better sparsity and parallelism (e.g., threshold pivoting, static pivoting)Partial pivoting used in sequential SuperLU (GEPP) Can force diagonal pivoting (use diagonal threshold)Hard to implement scalably for sparse factorizationStatic pivoting used in SuperLU_DIST (GESP)Before factor, scale and permute A to maximize diagonal: Pr Dr A Dc = A’Pr is found by a weighted bipartite matching algorithm on G(A)During factor A’ = LU, replace tiny pivots by , without changing data structures for L & UIf needed, use a few steps of iterative refinement to improve the first solution Quite stable in practiceAbs xxx x xCS2679Static Pivoting via Weighted Bipartite MatchingStatic Pivoting via Weighted Bipartite MatchingMaximize the diag. entries: sum, or product (sum of logs)Hungarian algo. or the like (MC64): O(n*(m+n)*log n)Auction algo. (more parallel): O(n*m*log(n*C))J. Riedy’s dissertation (expected Dec. 2006?)1 1AG(A)row column23 4523 45123541 x x 3 x 4 5CS26710Numerical Accuracy: GESP versus GEPPNumerical Accuracy: GESP versus GEPPCS26711Structural Gaussian Elimination - Symmetric CaseStructural Gaussian Elimination - Symmetric Case 1ijkEliminate 1ikj•Undirected graph•After a vertex is eliminated, all its neighbors become a clique•The edges of the clique are the potential fills (upper bound !)ij kij kEliminate 11ijk1ijk---------CS26712Minimum Degree Ordering (1/2)Minimum Degree Ordering (1/2)Greedy approach: do the best locallyBest for modest size problemsHard to parallelizeAt each step Eliminate the vertex with the smallest degree Update degrees of the neighborsStraightforward implementation is slow and requires too much memoryNewly
View Full Document