DOC PREVIEW
Berkeley COMPSCI C267 - Dense Linear Algebra: Parallel Matrix Multiplication

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 267 Dense Linear Algebra Parallel Matrix Multiplication James Demmel www cs berkeley edu demmel cs267 Spr06 02 09 2006 CS267 Lecture 8 1 Outline Recall BLAS Basic Linear Algebra Subroutines Matrix vector multiplication in parallel Matrix matrix multiplication in parallel 02 09 2006 CS267 Lecture 8 2 Review of the BLAS Building blocks for all linear algebra Parallel versions call serial versions on each processor So they must be fast Recall q flops mem refs The larger is q the faster the algorithm can go in the presence of memory hierarchy axpy y x y where scalar x and y vectors BLAS level Ex mem refs flops q 1 Axpy Dot prod 3n 2n1 2 3 2 Matrixvector mult n2 2n2 2 3 Matrixmatrix mult 4n2 2n3 n 2 02 09 2006 CS267 Lecture 8 3 Different Parallel Data Layouts for Matrices 0123012301230123 0 1 2 3 1 1D Column Blocked Layout 2 1D Column Cyclic Layout 0 1 2 3 0 1 2 3 4 Row versions of the previous layouts b 3 1D Column Block Cyclic Layout 0 0 1 0 1 0 1 0 1 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 5 2D Row and Column Blocked Layout 02 09 2006 Generalizes others 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 CS267 Lecture 8 2 3 2 3 2 3 2 3 6 2D Row and Column Block Cyclic Layout 4 Parallel Matrix Vector Product Compute y y A x where A is a dense matrix Layout 1D row blocked A i refers to the n by n p block row that processor i owns x i and y i similarly refer to segments of x y owned by i y Algorithm Foreach processor i Broadcast x i Compute y i A i x P0 P1 P2 P3 x P0 P1 P2 P3 Algorithm uses the formula y i y i A i x y i j A i j x j 02 09 2006 CS267 Lecture 8 5 Matrix Vector Product y y A x A column layout of the matrix eliminates the broadcast of x But adds a reduction to update the destination y A 2D blocked layout uses a broadcast and reduction both on a subset of processors sqrt p for square processor grid 02 09 2006 P0 P1 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 CS267 Lecture 8 6 Parallel Matrix Multiply Computing C C A B Using basic algorithm 2 n3 Flops Variables are Data layout Topology of machine Scheduling communication Use of performance models for algorithm design Message Time latency words time per word n Efficiency in any model serial time p parallel time perfect linear speedup efficiency 1 02 09 2006 CS267 Lecture 8 7 Matrix Multiply with 1D Column Layout Assume matrices are n x n and n is divisible by p p0 p1 p2 p3 p4 p5 p6 p7 May be a reasonable assumption for analysis not for code A i refers to the n by n p block column that processor i owns similiarly for B i and C i B i j is the n p by n p sublock of B i in rows j n p through j 1 n p Algorithm uses the formula C i C i A B i C i j A j B j i 02 09 2006 CS267 Lecture 8 8 Matrix Multiply 1D Layout on Bus or Ring Algorithm uses the formula C i C i A B i C i j A j B j i First consider a bus connected machine without broadcast only one pair of processors can communicate at a time ethernet Second consider a machine with processors on a ring all processors may communicate with nearest neighbors simultaneously 02 09 2006 CS267 Lecture 8 9 MatMul 1D layout on Bus without Broadcast Na ve algorithm C myproc C myproc A myproc B myproc myproc for i 0 to p 1 for j 0 to p 1 except i if myproc i send A i to processor j if myproc j receive A i from processor i C myproc C myproc A i B i myproc barrier Cost of inner loop computation 2 n n p 2 2 n3 p2 communication n2 p 02 09 2006 CS267 Lecture 8 10 Na ve MatMul continued Cost of inner loop computation 2 n n p 2 2 n3 p2 communication n2 p approximately Only 1 pair of processors i and j are active on any iteration and of those only i is doing computation the algorithm is almost entirely serial Running time p p 1 1 computation p p 1 communication 2 n3 p2 p n2 This is worse than the serial time and grows with p Why might you still want to do this 02 09 2006 CS267 Lecture 8 11 Matmul for 1D layout on a Processor Ring Pairs of processors can communicate simultaneously Copy A myproc into Tmp C myproc C myproc Tmp B myproc myproc for j 1 to p 1 Send Tmp to processor myproc 1 mod p Receive Tmp from processor myproc 1 mod p C myproc C myproc Tmp B myproc j mod p myproc Same idea as for gravity in simple sharks and fish algorithm May want double buffering in practice for overlap Ignoring deadlock details in code Time of inner loop 2 n2 p 2 n n p 2 02 09 2006 CS267 Lecture 8 12 Matmul for 1D layout on a Processor Ring Time of inner loop 2 n2 p 2 n n p 2 Total Time 2 n n p 2 p 1 Time of inner loop 2 n3 p 2 p 2 n2 Optimal for 1D layout on Ring or Bus even with with Broadcast Perfect speedup for arithmetic A myproc must move to each other processor costs at least p 1 cost of sending n n p words Parallel Efficiency 2 n3 p Total Time 1 1 p2 2 n3 p 2 n 1 1 O p n Grows to 1 as n p increases or and shrink 02 09 2006 CS267 Lecture 8 13 MatMul with 2D Layout Consider processors in 2D grid physical or logical Processors can communicate with 4 nearest neighbors Broadcast along rows and columns p 0 0 p 0 2 p 0 1 p 1 0 p 1 2 p 1 1 p 2 0 p 2 2 p 2 1 p 0 0 p 0 2 p 0 1 p 1 0 p 1 2 p 1 1 p 2 0 p 2 2 p 2 1 p 0 0 p 0 2 p 0 1 p 1 0 p 1 2 p 1 1 p 2 0 p 2 2 p 2 1 Assume p processors form square s x s grid 02 09 2006 CS267 Lecture 8 14 Cannon s Algorithm C i j C …


View Full Document

Berkeley COMPSCI C267 - Dense Linear Algebra: Parallel Matrix Multiplication

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Dense Linear Algebra: Parallel Matrix Multiplication
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Dense Linear Algebra: Parallel Matrix Multiplication and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Dense Linear Algebra: Parallel Matrix Multiplication 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?