Lecture 10: Cholesky Decomposition; BandedMatrices; Considerations for High PerformanceAMath 352Mon., Apr. 191 / 10Matrices for which Pivoting is not RequiredIf A is symmetric and positive definite (SPD), then pivoting is notrequired. Instead of factoring such a matrix as LU, one may factorit as LLT.Example:2 −1 0−1 2 −10 −1 2L1−→2 −1 00 3/2 −10 −1 2L2−→2 −1 00 3/2 −10 0 4/3,L1=1 0 01/2 1 00 0 1, L2=1 0 00 1 00 2/3 1.A = LU, L =1 0 0−1/2 1 00 −2/3 1,, U =2 −1 00 3/2 −10 0 4/3..2 / 10Matrices for which Pivoting is not RequiredIf A is symmetric and positive definite (SPD), then pivoting is notrequired. Instead of factoring such a matrix as LU, one may factorit as LLT.Example:2 −1 0−1 2 −10 −1 2L1−→2 −1 00 3/2 −10 −1 2L2−→2 −1 00 3/2 −10 0 4/3,L1=1 0 01/2 1 00 0 1, L2=1 0 00 1 00 2/3 1.A = LU, L =1 0 0−1/2 1 00 −2/3 1,, U =2 −1 00 3/2 −10 0 4/3..2 / 10Cholesky FactorizationA = LU, L =1 0 0−1/2 1 00 −2/3 1,, U =2 −1 00 3/2 −10 0 4/3.Can writeU =2 0 00 3/2 00 0 4/31 −1/2 00 1 −2/30 0 1= DLT.Thus A = LDLT= (LD1/2)(LD1/2)T:=˜L˜LT.˜L =√2 0 0−√2/2p3/2 00 −2/3p3/2 2/√3,.This is called the Cholesky factorization.3 / 10Cholesky FactorizationA = LU, L =1 0 0−1/2 1 00 −2/3 1,, U =2 −1 00 3/2 −10 0 4/3.Can writeU =2 0 00 3/2 00 0 4/31 −1/2 00 1 −2/30 0 1= DLT.Thus A = LDLT= (LD1/2)(LD1/2)T:=˜L˜LT.˜L =√2 0 0−√2/2p3/2 00 −2/3p3/2 2/√3,.This is called the Cholesky factorization.3 / 10Banded MatricesMany applications give rise to banded matrices.a1b10 0b1a2b200 b2a3b30 0 b3a43 ops−→a1b10 00 ˜a2b200 b2a3b30 0 b3a43 ops−→a1b10 00 ˜a2b200 0 ˜a3b30 0 b3a43 ops−→a1b10 00 ˜a2b200 0 ˜a3b30 0 0 ˜a4.Total: 3n ops.4 / 10Banded Matrices, Cont.Half bandwidth is m if aij= 0 for |i − j| > m. Half bandwidth of atridiagonal matrix is 1.Suppose half bandwidth is m and we don’t have to pivot. Use rowone to eliminate in col. one of rows 2 through m + 1; the entry incol. one of rows m + 2 through n is already 0:a11. . . a1,m+1a21. . . a2,m+1a2,m+2......am+1,1am+1,2m+2...an,n−m......5 / 10Banded Matrices, Cont.Half bandwidth is m if aij= 0 for |i − j| > m. Half bandwidth of atridiagonal matrix is 1.Suppose half bandwidth is m and we don’t have to pivot. Use rowone to eliminate in col. one of rows 2 through m + 1; the entry incol. one of rows m + 2 through n is already 0:a11. . . a1,m+1a21. . . a2,m+1a2,m+2......am+1,1am+1,2m+2...an,n−m......5 / 10Banded Matrices, Cont.At first stage, work on m rows, modify m + 1 entries. Work at firststage reduced from 2n(n − 1) to 2m(m + 1)Total work:n−1Xj=12 min(m, n − j) min(m + 1, n − j + 1) ≈ 2m2n.6 / 10Banded Matrices, Cont.At first stage, work on m rows, modify m + 1 entries. Work at firststage reduced from 2n(n − 1) to 2m(m + 1)Total work:n−1Xj=12 min(m, n − j) min(m + 1, n − j + 1) ≈ 2m2n.6 / 10Banded Matrices with PivotingSuppose we have a banded matrix but we do need to pivot?a1b10 0 0c1a2b20 00 c2a3b300 0 c3a4b40 0 0 c4a5P1−→c1a2b20 0a1b10 0 00 c2a3b300 0 c3a4b40 0 0 c4a5L1−→c1a2b20 00 ∗ ∗ 0 00 c2a3b300 0 c3a4b40 0 0 c4a5−→ . . . .Bandwidth at most doubles.7 / 10Banded Matrices with PivotingSuppose we have a banded matrix but we do need to pivot?a1b10 0 0c1a2b20 00 c2a3b300 0 c3a4b40 0 0 c4a5P1−→c1a2b20 0a1b10 0 00 c2a3b300 0 c3a4b40 0 0 c4a5L1−→c1a2b20 00 ∗ ∗ 0 00 c2a3b300 0 c3a4b40 0 0 c4a5−→ . . . .Bandwidth at most doubles.7 / 10Banded Matrices with PivotingSuppose we have a banded matrix but we do need to pivot?a1b10 0 0c1a2b20 00 c2a3b300 0 c3a4b40 0 0 c4a5P1−→c1a2b20 0a1b10 0 00 c2a3b300 0 c3a4b40 0 0 c4a5L1−→c1a2b20 00 ∗ ∗ 0 00 c2a3b300 0 c3a4b40 0 0 c4a5−→ . . . .Bandwidth at most doubles.7 / 10Implementation Considerations for High PerformanceIMemory refs may be costly.IUsually there is a memory hierarchy – larger storage such asmain memory or disk are slower; smaller storage such as localcache are faster.IWhen an arithmetic op is performed, operands are fetchedfrom storage, loaded into registers, where the arithmetic isperformed, and stored back to memory.IThe more work that can be done with a word fetched frommemory before storing the result, the better.Consider GE code. To eliminate the (i,j)-entry at stage j: Fetchthe pivot row j and row i (i > j), operate on row i, store the result.3(n-j+1) memory accesses, 2(n-j+1) ops.8 / 10Implementation Considerations for High PerformanceIMemory refs may be costly.IUsually there is a memory hierarchy – larger storage such asmain memory or disk are slower; smaller storage such as localcache are faster.IWhen an arithmetic op is performed, operands are fetchedfrom storage, loaded into registers, where the arithmetic isperformed, and stored back to memory.IThe more work that can be done with a word fetched frommemory before storing the result, the better.Consider GE code. To eliminate the (i,j)-entry at stage j: Fetchthe pivot row j and row i (i > j), operate on row i, store the result.3(n-j+1) memory accesses, 2(n-j+1) ops.8 / 10Basic Linear Algebra Subprograms (BLAS)IBLAS1: vector ops:~y ← a~x +~y.Fetch~x and~y (2n memory accesses). 2n ops. Store~y (nmemory accesses). Ratio of ops to memory accesses: 2/3.IBLAS2: matrix-vector ops:~y ← A~x +~y.Fetch~x,~y, and A (n2+ 2n memory accesses). ∼ 2n2ops.Store~y (n memory accesses). Ratio of ops to memoryaccesses: ∼ 2.IBLAS3: matrix-matrix ops: C ← AB + C.Fetch A, B, and C (3n2memory accesses). ∼ 2n3ops.Store C (n2memory accesses). Ratio of ops to memoryaccesses: ∼ n/2.9 / 10Block
View Full Document