Sparse Direct Methods on High Performance ComputersSparse linear solversAvailable sparse codesReview of Gaussian Elimination (GE)Sparse GEEnvelope (Profile) solverRCM orderingIs Profile Solver Good Enough?A General Data Structure: Compressed Column Storage (CCS)General Sparse SolverNumerical Stability: Need for PivotingDense versus Sparse GEAlgorithmic Issues in Sparse GEOrderingOrdering : Minimum Degree (1/3)Minimum Degree Ordering (2/3)Minimum Degree Ordering (3/3)Nested Dissection Ordering (1/3)ND Ordering (2/3)ND Ordering (3/3)Ordering for LU (unsymmetric)High Performance Issues: Reduce Cost of Memory Access & CommunicationSpeedup Over Un-blocked CodeSource of parallelism: Elimination TreeSource of parallelism: Separator TreeSource of parallelism: global partition and distributionMajor stages of sparse LUSuperLU_MT [Li/Demmel/Gilbert]SuperLU_DIST [Li/Demmel/Grigori]Multicore platformsSingle-core, single threaded BLASBenchmark matricesClovertownVictoriaFalls – multicore + multithreadLarger matricesStrong scaling: IBM Power5 (1.9 GHz)Weak scalingAnalysis of scalability and isoefficiencySummaryOpen problemsExtra SlidesStatic Pivoting via Weighted Bipartite MatchingNumerical Accuracy: GESP versus GEPPParallel Symbolic Factorization [Grigori/Demmel/Li ‘06]Application 1: Quantum MechanicsQuantum Mechanics (cont.)SuperLU_DIST as PreconditionerOne Block Timings on IBM SPApplication 2: Accelerator Cavity DesignAccelerator (cont.)Slide 51DDS47, Linear ElementsLargest Eigen Problem Solved So FarSuperfast Factorization: Exploit Low-rank PropertyResults for the Model ProblemResearch IssuesSparse Direct Methods on High Performance ComputersSparse Direct Methods on High Performance ComputersX. Sherry [email protected]://crd.lbl.gov/~xiaoyeCS267/ENgC233: Applications of Parallel ComputingApril 6, 2009CS267Sparse linear solversSparse linear solvers2Solving a system of linear equations Ax = bIterative methodsA is not changed (read-only)Key kernel: sparse matrix-vector multiplyEasier to optimize and parallelizeLow algorithmic complexity, but may not converge for hard problemsDirect methodsA is modified (factorized)Harder to optimize and parallelizeNumerically robust, but higher algorithmic complexityOften use direct method to precondition iterative methodCS267Available sparse codesAvailable sparse codesSurvey of different types of factorization codeshttp://crd.lbl.gov/~xiaoye/SuperLU/SparseDirectSurvey.pdfLLT (s.p.d.) LDLT (symmetric indefinite) LU (nonsymmetric)QR (least squares)Sequential, shared-memory (multicore), distributed-memory, out-of-coreDistributed-memory codes: usually MPI-basedSuperLU_DIST [Li/Demmel/Grigori]accessible from PETSc, TrilinosMUMPS, PasTiX, WSMP, . . .3CS2674Review of Gaussian Elimination (GE)Review of Gaussian Elimination (GE)First step of GE:Repeats GE on CResults in LU factorization (A = LU)L lower triangular with unit diagonal, U upper triangularThen, x is obtained by solving two triangular systems with L and UCwIvBvwATT0/01TwvBCCS2675Sparse GESparse GESparse matrices are ubiquitousExample: A of dimension 105, only 10~100 nonzeros per rowNonzero costs flops and memoryScalar algorithm: 3 nested loopsCan re-arrange loops to get different variants: left-looking, right-looking, . . .1234675LLUUfor i = 1 to n column_scale ( A(:,i) ) for k = i+1 to n s.t. A(i,k) != 0 for j = i+1 to n s.t. A(j,i) != 0 A(j,k) = A(j,k) - A(j,i) * A(i,k)Typical fill-ratio: 10x for 2D problems, 30-50x for 3D problemsCS267Envelope (Profile) solverEnvelope (Profile) solverDefine bandwidth for each row or columnA little more sophisticated than band solverUse Skyline storage (SKS)Lower triangle stored row by rowUpper triangle stored column by columnIn each row (column), first nonzerodefines a profileAll entries within the profile (some may be zeros) are storedAll fill-ins are confined in the profileA good ordering would be based on bandwidth reductionE.g., (reverse) Cuthill-McKee6CS267RCM orderingRCM ordering7Breadth-first search, numbering by levels, then reverseCS267Example: 3 orderings (natural, RCM, MD)Envelop size = sum of bandwidthsAfter LU, envelop would be entirely filledIs Profile Solver Good Enough?Is Profile Solver Good Enough?Env = 31775 Env = 22320Env = 61066NNZ(L, MD) = 122598CS2679A General Data Structure: Compressed Column A General Data Structure: Compressed Column Storage (CCS)Storage (CCS)Also known as Harwell-Boeing formatStore nonzeros columnwise contiguously3 arrays:Storage: NNZ reals, NNZ+N+1 integersEfficient for columnwise algorithms“Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods”, R. Barrett et al.7654321lkjihgfedcbanzval 1 c 2 d e 3 k a 4 h b f 5 i l 6 g j 7 rowind 1 3 2 3 4 3 7 1 4 6 2 4 5 6 7 6 5 6 7colptr 1 3 6 8 11 16 17 20CS267General Sparse SolverGeneral Sparse SolverUse (blocked) CRS or CCS, and any ordering methodLeave room for fill-ins ! (symbolic factorization)Exploit “supernodal” (dense) structures in the factorsCan use Level 3 BLASReduce inefficient indirect addressing (scatter/gather)Reduce graph traversal time using a coarser graph10CS26711Numerical Stability: Need for PivotingNumerical Stability: Need for PivotingOne step of GE: If α is small, some entries in B may be lost from additionPivoting: swap the current diagonal entry with a larger entry from the other part of the matrixGoal: prevent from getting too largeCwIvBvwATT0/01TwvBCCCS26712Dense versus Sparse GEDense versus Sparse GEDense GE: Pr A Pc = LUPr and Pc are permutations chosen to maintain stabilityPartial pivoting suffices in most cases : Pr A = LUSparse GE: Pr A Pc = LUPr and Pc are chosen to maintain stability and preserve sparsityDynamic pivoting causes dynamic structural change•Alternatives: threshold pivoting, static pivoting, . . .bs xxx x xxCS26713Algorithmic Issues in Sparse GEAlgorithmic Issues in Sparse GEMinimize number of fill-ins, maximize parallelismSparsity structure of L & U depends on that
View Full Document