Sparse Direct Methods on High Performance ComputersReview of Gaussian Elimination (GE)Sparse GECompressed Column Storage (CCS)Numerical Stability: Need for PivotingDense versus Sparse GEAlgorithmic Issues in Sparse GEOrdering : Minimum Degree (1/3)Minimum Degree Ordering (2/3)Minimum Degree Ordering (3/3)Ordering : Nested Dissection (1/3)ND Ordering (2/3)ND Ordering (3/3)Ordering for LU (unsymmetric)Ordering for LUStructural Gaussian Elimination - Unsymmetric CaseResults of Markowitz OrderingHigh Performance Issues: Reduce Cost of Memory Access & CommunicationGeneral Sparse SolverSpeedup Over Un-blocked CodeParallel Task Scheduling for SMPs (in SuperLU_MT)Parallelism from Separator TreeMatrix Distribution on Large Distributed-memory Machine2D Block Cyclic Layout for Sparse L and U (in SuperLU_DIST)Scalability and Isoefficiency AnalysisScalabilityIrregular MatricesPerformance on IBM Power5 (1.9 GHz)Performance on IBM Power3 (375 MHz)SummaryOpen ProblemsAdoptions of SuperLUExtra SlidesNumerical PivotingStatic Pivoting via Weighted Bipartite MatchingNumerical Accuracy: GESP versus GEPPBlocking in Sparse GEParallel Symbolic Factorization [Grigori/Demmel/Li ‘06]Application 1: Quantum MechanicsQuantum Mechanics (cont.)SuperLU_DIST as PreconditionerOne Block Timings on IBM SPApplication 2: Accelerator Cavity DesignAccelerator (cont.)Slide 45DDS47, Linear ElementsLargest Eigen Problem Solved So FarModel ProblemSuperfast Factorization: Exploit Low-rank PropertyResults for the Model ProblemResearch IssuesSparse Direct Methods on High Performance ComputersSparse Direct Methods on High Performance ComputersX. Sherry [email protected]://crd.lbl.gov/~xiaoyeCS267/E233: Applications of Parallel ComputingMarch 14, 2007CS2672Review of Gaussian Elimination (GE)Review of Gaussian Elimination (GE)Solving a system of linear equations Ax = bFirst step of GE:Repeats GE on CResults in LU factorization (A = LU)L lower triangular with unit diagonal, U upper triangularThen, x is obtained by solving two triangular systems with L and UCwIvBvwATT0/01TwvBCCS2673Sparse GESparse GESparse matrices are ubiquitousExample: A of dimension 105, only 10~100 nonzeros per rowGoal: Store only nonzeros and perform operations only on nonzerosFill-in: original zero entry aij becomes nonzero in L and UNatural order: nonzeros = 233 Min. Degree order: nonzeros = 207CS2674Compressed Column Storage (CCS)Compressed Column Storage (CCS)Also known as Harwell-Boeing formatStore nonzeros columnwise contiguously3 arrays:Storage: NNZ reals, NNZ+N+1 integersEfficient for columnwise algorithmsRef: Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, R. Barrett et al.7654321lkjihgfedcbanzval 1 c 2 d e 3 k a 4 h b f 5 i l 6 g j 7 rowind 1 3 2 3 4 3 7 1 4 6 2 4 5 6 7 6 5 6 7colptr 1 3 6 8 11 16 17 20CS2675Numerical Stability: Need for PivotingNumerical Stability: Need for PivotingOne step of GE: If α is small, some entries in B may be lost from additionPivoting: swap the current diagonal entry with a larger entry from the other part of the matrixGoal: prevent from getting too largeCwIvBvwATT0/01TwvBCCCS2676Dense versus Sparse GEDense versus Sparse GEDense GE: Pr A Pc = LUPr and Pc are permutations chosen to maintain stabilityPartial pivoting suffices in most cases : Pr A = LUSparse GE: Pr A Pc = LUPr and Pc are chosen to maintain stability and preserve sparsityDynamic pivoting causes dynamic structural changeAlternatives: threshold pivoting, static pivoting, . . .bs xxx x xxCS2677Algorithmic Issues in Sparse GEAlgorithmic Issues in Sparse GEMinimize number of fill-ins, maximize parallelismSparsity structure of L & U depends on that of A, which can be changed by row/column permutations (vertex re-labeling of the underlying graph)Ordering (combinatorial algorithms; NP-complete to find optimum [Yannakis ’83]; use heuristics)Predict the fill-in positions in L & USymbolic factorization (combinatorial algorithms)Perform factorization and triangular solutionsNumerical algorithms (F.P. operations only on nonzeros)How and when to pivot ?Usually dominate the total runtimeCS2678Ordering : Minimum Degree (1/3)Ordering : Minimum Degree (1/3)Local greedy: minimize upper bound on fill-inEliminate 1 1ijkEliminate 1ikjxxxxxxxxxi j k l1ijkl----------------xxxxxxxxxi j k l1ijklllCS2679Minimum Degree Ordering (2/3)Minimum Degree Ordering (2/3)Greedy approach: do the best locallyBest for modest size problemsHard to parallelizeAt each step Eliminate the vertex with the smallest degree Update degrees of the neighborsStraightforward implementation is slow and requires too much memoryNewly added edges are more than eliminated verticesCS26710Minimum Degree Ordering (3/3)Minimum Degree Ordering (3/3)Use quotient graph as a compact representation [George/Liu ’78]Collection of cliques resulting from the eliminated vertices affects the degree of an uneliminated vertexRepresent each connected component in the eliminated subgraph by a single “supervertex”Storage required to implement QG model is bounded by size of ALarge body of literature on implementation variantsTinney/Walker `67, George/Liu `79, Liu `85, Amestoy/Davis/Duff `94, Ashcraft `95, Duff/Reid `95, et al., . .CS26711Ordering : Nested Dissection (1/3)Ordering : Nested Dissection (1/3)Model problem: discretized system Ax = b from certain PDEs, e.g., 5-point stencil on n x n grid, N = n^2Factorization flops:Theorem: ND ordering gave optimal complexity in exact arithmetic [George ’73, Hoffman/Martin/Ross])()(2/33NOnO CS26712ND Ordering (2/3)ND Ordering (2/3)Generalized nested dissection [Lipton/Rose/Tarjan ’79]Global graph partitioning: top-down, divide-and-conqureBest for largest problemsParallel code available: e.g., ParMETISFirst levelRecurse on A
View Full Document