DOC PREVIEW
Berkeley COMPSCI C267 - Dense Linear Algebra: Parallel Matrix Multiplication

This preview shows page 1-2-3-24-25-26-27-49-50-51 out of 51 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 267 Dense Linear Algebra: Parallel Matrix MultiplicationOutlineReview of the BLASDifferent Parallel Data Layouts for Matrices Why? Want parallelism within submatricesParallel Matrix-Vector ProductMatrix-Vector Product y = y + A*xSlide 7Parallel Matrix MultiplyMatrix Multiply with 1D Column LayoutMatrix Multiply: 1D Layout on Bus or RingMatMul: 1D layout on Bus without BroadcastNaïve MatMul (continued)Matmul for 1D layout on a Processor RingSlide 14MatMul with 2D LayoutCannon’s AlgorithmCannon’s Matrix MultiplicationInitial Step to Skew Matrices in CannonShifting Steps in CannonCost of Cannon’s AlgorithmPros and Cons of CannonSUMMA AlgorithmSUMMASlide 24SUMMA performanceSlide 26Slide 27Slide 28Recursive LayoutsSummary of Parallel Matrix MultiplicationExtra SlidesGaussian EliminationGaussian Elimination via a Recursive AlgorithmRecursive FactorizationsSlide 35Review: BLAS 3 (Blocked) GEPPSlide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46A small software project ...Work-Depth Model of ParallelismLatency Bandwidth ModelSlide 50Skewing Steps in Cannon02/21/2007 CS267 Lecture DLA11CS 267Dense Linear Algebra:Parallel Matrix MultiplicationJames Demmelwww.cs.berkeley.edu/~demmel/cs267_Spr0702/21/2007 CS267 Lecture DLA12Outline•Recall BLAS = Basic Linear Algebra Subroutines•Matrix-vector multiplication in parallel•Matrix-matrix multiplication in parallel02/21/2007 CS267 Lecture DLA13Review of the BLASBLAS level Ex. # mem refs # flops q 1 “Axpy”, Dot prod3n2n12/32 Matrix-vector multn22n223 Matrix-matrix mult4n22n3n/2• Building blocks for all linear algebra• Parallel versions call serial versions on each processor• So they must be fast!• Recall q = # flops / # mem refs• The larger is q, the faster the algorithm can go in the presence of memory hierarchy• “axpy”: y = *x + y, where  scalar, x and y vectors02/21/2007 CS267 Lecture DLA14Different Parallel Data Layouts for Matrices Why? Want parallelism within submatrices01 23 01 23 01 23 01 230 1 2 3 0 1 2 31) 1D Column Blocked Layout 2) 1D Column Cyclic Layout3) 1D Column Block Cyclic Layout4) Row versions of the previous layoutsGeneralizes others0 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 36) 2D Row and Column Block Cyclic Layout0 1 2 30 12 35) 2D Row and Column Blocked Layoutb02/21/2007 CS267 Lecture DLA15Parallel Matrix-Vector Product•Compute y = y + A*x, where A is a dense matrix•Layout: •1D row blocked•A(i) refers to the n by n/p block row that processor i owns, •x(i) and y(i) similarly refer to segments of x,y owned by i•Algorithm:•Foreach processor i• Broadcast x(i)• Compute y(i) = A(i)*x•Algorithm uses the formulay(i) = y(i) + A(i)*x = y(i) + j A(i,j)*x(j)xyP0P1P2P3P0 P1 P2 P302/21/2007 CS267 Lecture DLA16Matrix-Vector Product y = y + A*x•A column layout of the matrix eliminates the broadcast of x•But adds a reduction to update the destination y•A 2D blocked layout uses a broadcast (of x(j) in a processor column) and reduction (of A(i,j)*x(j) in a processor row),•sqrt(p) by sqrt(p) for square processor gridP0 P1 P2 P3P0 P1 P2 P3P4 P5 P6 P7P8 P9 P10 P11P12 P13 P14 P15P0P4P8P1202/21/2007 CS267 Lecture DLA17Matrix-Vector Product y = y + A*x•A column layout of the matrix eliminates the broadcast of x•But adds a reduction to update the destination y•A 2D blocked layout uses a broadcast (of x(j) in a processor column) and reduction (of A(i,j)*x(j) in a processor row),•sqrt(p) by sqrt(p) for square processor gridP0 P5 P10 P15P0 P1 P2 P3P4 P5 P6 P7P8 P9 P10 P11P12 P13 P14 P15P0P5P10P15It may be usefulto have x(i) and y(i) on same proc02/21/2007 CS267 Lecture DLA18Parallel Matrix Multiply•Computing C=C+A*B•Using basic algorithm: 2*n3 Flops•Variables are:•Data layout•Topology of machine •Scheduling communication•Use of performance models for algorithm design•Message Time = “latency” + #words * time-per-word =  + n*and measured in #flop times (i.e. time per flop = 1)•Efficiency (in any model):•serial time / (p * parallel time)•perfect (linear) speedup  efficiency = 102/21/2007 CS267 Lecture DLA19Matrix Multiply with 1D Column Layout•Assume matrices are n x n and n is divisible by p•A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i))•B(i,j) is the n/p by n/p sublock of B(i) •in rows j*n/p through (j+1)*n/p•Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i)p0 p1 p2 p3 p5 p4 p6 p7 May be a reasonable assumption for analysis, not for code02/21/2007 CS267 Lecture DLA110Matrix Multiply: 1D Layout on Bus or Ring•Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i)•First, consider a bus-connected machine without broadcast: only one pair of processors can communicate at a time (ethernet)•Second, consider a machine with processors on a ring: all processors may communicate with nearest neighbors simultaneously02/21/2007 CS267 Lecture DLA111MatMul: 1D layout on Bus without BroadcastNaïve algorithm: C(myproc) = C(myproc) + A(myproc)*B(myproc,myproc) for i = 0 to p-1 … for each block column A(i) of A for j = 0 to p-1 except i … for every processor not having A(i) if (myproc == i) send A(i) to processor j if (myproc == j) receive A(i) from processor i C(myproc) = C(myproc) + A(i)*B(i,myproc) barrierCost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication:  + *n2 /p02/21/2007 CS267 Lecture DLA112Naïve MatMul (continued)Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication:  + *n2 /p … approximatelyOnly 1 pair of processors (i and j) are active on any iteration, and of those, only i is doing computation => the algorithm is almost entirely serialRunning time: = (p*(p-1) + 1)*computation + p*(p-1)*communication ~= 2*n3 + p2* + p*n2* This is worse than the serial time and grows with p.Why might you still want to do this?02/21/2007 CS267 Lecture DLA113Matmul for 1D layout on a Processor Ring•Pairs of processors can communicate simultaneouslyCopy A(myproc) into TmpC(myproc) = C(myproc) + Tmp*B(myproc , myproc)for j = 1 to p-1 Send Tmp to


View Full Document

Berkeley COMPSCI C267 - Dense Linear Algebra: Parallel Matrix Multiplication

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Dense Linear Algebra: Parallel Matrix Multiplication
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Dense Linear Algebra: Parallel Matrix Multiplication and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Dense Linear Algebra: Parallel Matrix Multiplication 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?