Rice CAAM 471 - Block LU Facttorization

Unformatted text preview:

Block LU Factorization Lecture 24Example CaseProblemsDivide and ConquerBack to the Linear SystemConstructing the Block LU FactorizationcontSummary First StageNow Factorize Lower SE BlockEnd ResultMatlab VersionParallel AlgorithmParallel CommunicationCommunication SummaryUpshotBlock Back SolveRecallDistributed Back SolveMatlab CodeBack SolveBarrier Between Back SolvesExample CodeBlock LU FactorizationLecture 24MA471 Fall 2003Example Case1) Suppose we are faced with the solution of a linear system Ax=b 2) Further suppose:1) A is large (dim(A)>10,000) 2) A is dense3) A is full4) We have a sequence of different b vectors.Problems•Suppose we are able to compute the matrix ––It costs N2 doubles to store the matrix–E.g. for N=100,000 we require 76.3 gigabytes of storage for the matrix alone.–32 bit processors are limited to 4 gigabytes of memory–Most desktops (even 64 bit) do not have 76.3 gigabytes–What to do?Divide and ConquerP0 P1 P2 P3P4 P5 P6 P7P8 P9 P10 P11P12 P13 P14 P15One approach is to assume we have a square number of processors.We then divide the matrix into blocks – storing one block per processor.Back to the Linear System•We are now faced with LU factorization of a distributed matrix.•This calls for a modified LU routine which acts on blocks of the matrix.•We will demonstrate this algorithm for one level.•i.e. we need to construct matrices L,U such that A=LU and we only store single blocks of A,L,U on any processor.Constructing the Block LU FactorizationA00 A01 A02A10 A11 A12A20 A21 A22=L00 0 0L10 1 0L20 0 1*U00 U01 U020 ?11 ?120 ?21 ?22First we LU factorize A00 and look for the above block factorization. However, we need to figure out what each of the entries are: A00 = L00*U00 (compute by L00, U00 by LU factorization)A01 = L00*U01 => U01 = L00\A01A02 = L00*U02 => U02 = L00\A02A10 = L10*U00 => L10 = A10/U00A20 = L20*U00 => L20 = A20/U00A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01..contA00 = L00*U00 (compute by L00, U00 by LU factorization)A01 = L00*U01 => U01 = L00\A01A02 = L00*U02 => U02 = L00\A02A10 = L10*U00 => L10 = A10/U00A20 = L20*U00 => L20 = A20/U00A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01A12 = L10*U02 + ?12 => ?12 = A12 – L10*U02A21 = L20*U01 + ?21 => ?21 = A21 – L20*U01A22 = L20*U02 + ?22 => ?22 = A22 – L20*U02In the general case:Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0mSummary First StageA00 A01 A02A10 A11 A12A20 A21 A22=L00 0 0L10 1 0L20 0 1*U00 U01 U020 ?11 ?120 ?21 ?22First step: LU factorize uppermost block diagonalSecond step: a) compute U0n = L00\A0n n>0 b) compute Ln0 = An0/U00 n>0Third step: compute ?nm = Anm – Ln0*U0m, (n,m>0)Now Factorize Lower SE Block?11 ?12?21 ?22=L11 0L21 1*U11 U120 ??22We repeat the previous algorithm this time on the two by two SE block.End ResultA00 A01 A02A10 A11 A12A20 A21 A22=L00 0 0L10 L11 0L20 L21 L22*U00 U01 U020 U11 U120 0 U22Matlab VersionParallel AlgorithmP0 P1 P2P3 P4 P5P6 P7 P8P0: A00 = L00*U00 (compute by L00, U00 by LU factorization)P1: U01 = L00\A01P2: U02 = L00\A02P3: L10 = A10/U00P6: L20 = A20/U00P4: A11 <- A11 – L10*U01P5: A12 <- A12 – L10*U02P7: A21 <- A21 – L20*U01P8: A22 <- A22 – L20*U02In the general case:Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0mParallel CommunicationL00U00U01 U02L10 A11 A12L20 A21 A22P0: L00,U00 =lu(A)P1: U01 = L00\A01P2: U02 = L00\A02P3: L10 = A10/U00P6: L20 = A20/U00P4: A11 <- A11 – L10*U01P5: A12 <- A12 – L10*U02P7: A21 <- A21 – L20*U01P8: A22 <- A22 – L20*U02In the general case:Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0mCommunication SummaryP0: L00,U00 =lu(A)P1: U01 = L00\A01P2: U02 = L00\A02P3: L10 = A10/U00P6: L20 = A20/U00P4: A11 <- A11 – L10*U01P5: A12 <- A12 – L10*U02P7: A21 <- A21 – L20*U01P8: A22 <- A22 – L20*U02P0: sends L00 to P1,P2 sends U00 to P3,P6P1: sends U01 to P4,P7P2: sends U02 to P5,P8P3: sends L10 to P4,P5P4: sends L20 to P7,P8P0 P1 P2P3 P4 P5P6 P7 P8L00U00U01 U02L10 A11 A12L20 A21 A22UpshotNotes:1) I added an MPI_Barrier purely to separate the LU factorization and the backsolve.2) In terms of efficiency we can see that quite a bit of time is spent in MPI_Wait compared to compute time.3) The compute part of this code can be optimized much more – making the parallelefficiency even worse. ab(a) P0: sends L00 to P1,P2 sends U00 to P3,P6(b) P1: sends U01 to P4,P7(c) P2: sends U02 to P5,P8(d) P3: sends L10 to P4,P5(e) P4: sends L20 to P7,P8cde(f) P4: sends L11 to P5 sends U11 to P7(g) P1: sends U12 to P8(h) P3: sends L21 to P8f1st stage:1st stage:ghBlock Back Solve •After factorization we are left with the task of using the distributed L and U to compute the backsolve:U00L00U01 U02L10U11L11U12L20 L21U22L22Block distribution of L and UP0 P1 P2P3 P4 P5P6 P7 P8Recall•Given an LU factorization of A namely, L,U such that A=LU•Then we can solve Ax=b by•y=L\b•x=U\yDistributed Back SolveL00 0 0L10 L11 0L20 L21 L22=y0y1y2b0b1b2P0: solve L00*y0 = b0 send: y0 to P3,P6P3: send: L10*y0 to P4P4: solve L11*y1 = b1-L10*y0 send: y1 to P7P6: send: L20*y0 to P8\P7: send: L21*y1 to P8P8: solve L22*y2 = b2-L20*y0-L21*y1Results: y0 on P0, y1 on P4, y2 on P8P0 P1 P2P3 P4 P5P6 P7 P8Matlab CodeBack SolveAfter the factorization we computed a solution to Ax=bThis consists of two distributed block triangular systems to solveBarrier Between Back SolvesThis time I inserted an MPI_Barrier call between the backsolves. This highlights the serial nature of the backsolves..Example Codehttp://www.math.unm.edu/~timwar/MA471F03/blocklu.m


View Full Document

Rice CAAM 471 - Block LU Facttorization

Download Block LU Facttorization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Block LU Facttorization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Block LU Facttorization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?