Berkeley COMPSCI C267 - Dense Linear Algebra: Parallel Matrix Multiplication - D2420419

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Dense Linear Algebra: Parallel Matrix Multiplication

DOC PREVIEW

Berkeley COMPSCI C267 - Dense Linear Algebra: Parallel Matrix Multiplication

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 48

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 48 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 267 Dense Linear Algebra Parallel Matrix Multiplication James Demmel www cs berkeley edu demmel cs267 Spr06 02 09 2006 CS267 Lecture 8 1 Outline Recall BLAS Basic Linear Algebra Subroutines Matrix vector multiplication in parallel Matrix matrix multiplication in parallel 02 09 2006 CS267 Lecture 8 2 Review of the BLAS Building blocks for all linear algebra Parallel versions call serial versions on each processor So they must be fast Recall q flops mem refs The larger is q the faster the algorithm can go in the presence of memory hierarchy axpy y x y where scalar x and y vectors BLAS level Ex mem refs flops q 1 Axpy Dot prod 3n 2n1 2 3 2 Matrixvector mult n2 2n2 2 3 Matrixmatrix mult 4n2 2n3 n 2 02 09 2006 CS267 Lecture 8 3 Different Parallel Data Layouts for Matrices 0123012301230123 0 1 2 3 1 1D Column Blocked Layout 2 1D Column Cyclic Layout 0 1 2 3 0 1 2 3 4 Row versions of the previous layouts b 3 1D Column Block Cyclic Layout 0 0 1 0 1 0 1 0 1 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 5 2D Row and Column Blocked Layout 02 09 2006 Generalizes others 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 CS267 Lecture 8 2 3 2 3 2 3 2 3 6 2D Row and Column Block Cyclic Layout 4 Parallel Matrix Vector Product Compute y y A x where A is a dense matrix Layout 1D row blocked A i refers to the n by n p block row that processor i owns x i and y i similarly refer to segments of x y owned by i y Algorithm Foreach processor i Broadcast x i Compute y i A i x P0 P1 P2 P3 x P0 P1 P2 P3 Algorithm uses the formula y i y i A i x y i j A i j x j 02 09 2006 CS267 Lecture 8 5 Matrix Vector Product y y A x A column layout of the matrix eliminates the broadcast of x But adds a reduction to update the destination y A 2D blocked layout uses a broadcast and reduction both on a subset of processors sqrt p for square processor grid 02 09 2006 P0 P1 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 CS267 Lecture 8 6 Parallel Matrix Multiply Computing C C A B Using basic algorithm 2 n3 Flops Variables are Data layout Topology of machine Scheduling communication Use of performance models for algorithm design Message Time latency words time per word n Efficiency in any model serial time p parallel time perfect linear speedup efficiency 1 02 09 2006 CS267 Lecture 8 7 Matrix Multiply with 1D Column Layout Assume matrices are n x n and n is divisible by p p0 p1 p2 p3 p4 p5 p6 p7 May be a reasonable assumption for analysis not for code A i refers to the n by n p block column that processor i owns similiarly for B i and C i B i j is the n p by n p sublock of B i in rows j n p through j 1 n p Algorithm uses the formula C i C i A B i C i j A j B j i 02 09 2006 CS267 Lecture 8 8 Matrix Multiply 1D Layout on Bus or Ring Algorithm uses the formula C i C i A B i C i j A j B j i First consider a bus connected machine without broadcast only one pair of processors can communicate at a time ethernet Second consider a machine with processors on a ring all processors may communicate with nearest neighbors simultaneously 02 09 2006 CS267 Lecture 8 9 MatMul 1D layout on Bus without Broadcast Na ve algorithm C myproc C myproc A myproc B myproc myproc for i 0 to p 1 for j 0 to p 1 except i if myproc i send A i to processor j if myproc j receive A i from processor i C myproc C myproc A i B i myproc barrier Cost of inner loop computation 2 n n p 2 2 n3 p2 communication n2 p 02 09 2006 CS267 Lecture 8 10 Na ve MatMul continued Cost of inner loop computation 2 n n p 2 2 n3 p2 communication n2 p approximately Only 1 pair of processors i and j are active on any iteration and of those only i is doing computation the algorithm is almost entirely serial Running time p p 1 1 computation p p 1 communication 2 n3 p2 p n2 This is worse than the serial time and grows with p Why might you still want to do this 02 09 2006 CS267 Lecture 8 11 Matmul for 1D layout on a Processor Ring Pairs of processors can communicate simultaneously Copy A myproc into Tmp C myproc C myproc Tmp B myproc myproc for j 1 to p 1 Send Tmp to processor myproc 1 mod p Receive Tmp from processor myproc 1 mod p C myproc C myproc Tmp B myproc j mod p myproc Same idea as for gravity in simple sharks and fish algorithm May want double buffering in practice for overlap Ignoring deadlock details in code Time of inner loop 2 n2 p 2 n n p 2 02 09 2006 CS267 Lecture 8 12 Matmul for 1D layout on a Processor Ring Time of inner loop 2 n2 p 2 n n p 2 Total Time 2 n n p 2 p 1 Time of inner loop 2 n3 p 2 p 2 n2 Optimal for 1D layout on Ring or Bus even with with Broadcast Perfect speedup for arithmetic A myproc must move to each other processor costs at least p 1 cost of sending n n p words Parallel Efficiency 2 n3 p Total Time 1 1 p2 2 n3 p 2 n 1 1 O p n Grows to 1 as n p increases or and shrink 02 09 2006 CS267 Lecture 8 13 MatMul with 2D Layout Consider processors in 2D grid physical or logical Processors can communicate with 4 nearest neighbors Broadcast along rows and columns p 0 0 p 0 2 p 0 1 p 1 0 p 1 2 p 1 1 p 2 0 p 2 2 p 2 1 p 0 0 p 0 2 p 0 1 p 1 0 p 1 2 p 1 1 p 2 0 p 2 2 p 2 1 p 0 0 p 0 2 p 0 1 p 1 0 p 1 2 p 1 1 p 2 0 p 2 2 p 2 1 Assume p processors form square s x s grid 02 09 2006 CS267 Lecture 8 14 Cannon s Algorithm C i j C …

View Full Document

Berkeley COMPSCI C267 - Dense Linear Algebra: Parallel Matrix Multiplication

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Berkeley COMPSCI C267 - Dense Linear Algebra: Parallel Matrix Multiplication

Sign up for free to view:

Please select your school