DOC PREVIEW
Berkeley COMPSCI C267 - Dense Linear Algebra: Parallel Matrix Multiplication

This preview shows page 1-2-3-23-24-25-26-46-47-48 out of 48 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 48 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 267 Dense Linear Algebra: Parallel Matrix MultiplicationOutlineReview of the BLASDifferent Parallel Data Layouts for MatricesParallel Matrix-Vector ProductMatrix-Vector Product y = y + A*xParallel Matrix MultiplyMatrix Multiply with 1D Column LayoutMatrix Multiply: 1D Layout on Bus or RingMatMul: 1D layout on Bus without BroadcastNaïve MatMul (continued)Matmul for 1D layout on a Processor RingSlide 13MatMul with 2D LayoutCannon’s AlgorithmCannon’s Matrix MultiplicationInitial Step to Skew Matrices in CannonSkewing Steps in CannonCost of Cannon’s AlgorithmDrawbacks to CannonSUMMA AlgorithmSUMMASlide 23SUMMA performanceSlide 25Slide 26Slide 27Recursive LayoutsSummary of Parallel Matrix MultiplicationExtra SlidesGaussian EliminationGaussian Elimination via a Recursive AlgorithmRecursive FactorizationsSlide 34Review: BLAS 3 (Blocked) GEPPSlide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45A small software project ...Work-Depth Model of ParallelismLatency Bandwidth Model02/14/2005 CS267 Lecture 81CS 267Dense Linear Algebra:Parallel Matrix MultiplicationJames Demmelwww.cs.berkeley.edu/~demmel/cs267_Spr0502/14/2005 CS267 Lecture 82Outline•Recall BLAS = Basic Linear Algebra Subroutines•Matrix-vector multiplication in parallel•Matrix-matrix multiplication in parallel02/14/2005 CS267 Lecture 83Review of the BLASBLAS level Ex. # mem refs # flops q 1 “Axpy”, Dot prod3n2n12/32 Matrix-vector multn22n223 Matrix-matrix mult4n22n3n/2• Building blocks for all linear algebra• Parallel versions call serial versions on each processor• So they must be fast!• Recall q = # flops / # mem refs• The larger is q, the faster the algorithm can go in the presence of memory hierarchy• “axpy”: y = *x + y, where  scalar, x and y vectors02/14/2005 CS267 Lecture 84Different Parallel Data Layouts for Matrices01 23 01 23 01 23 01 230 1 2 3 0 1 2 31) 1D Column Blocked Layout 2) 1D Column Cyclic Layout3) 1D Column Block Cyclic Layout4) Row versions of the previous layoutsGeneralizes others0 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 36) 2D Row and Column Block Cyclic Layout0 1 2 30 12 35) 2D Row and Column Blocked Layoutb02/14/2005 CS267 Lecture 85Parallel Matrix-Vector Product•Compute y = y + A*x, where A is a dense matrix•Layout: •1D row blocked•A(i) refers to the n by n/p block row that processor i owns, •x(i) and y(i) similarly refer to segments of x,y owned by i•Algorithm:•Foreach processor i• Broadcast x(i)• Compute y(i) = A(i)*x•Algorithm uses the formulay(i) = y(i) + A(i)*x = y(i) + j A(i,j)*x(j)xyP0P1P2P3P0 P1 P2 P302/14/2005 CS267 Lecture 86Matrix-Vector Product y = y + A*x•A column layout of the matrix eliminates the broadcast of x•But adds a reduction to update the destination y•A 2D blocked layout uses a broadcast and reduction, both on a subset of processors•sqrt(p) for square processor gridP0 P1 P2 P3P0 P1 P2 P3P4 P5 P6 P7P8 P9 P10 P11P12 P13 P14 P1502/14/2005 CS267 Lecture 87Parallel Matrix Multiply•Computing C=C+A*B•Using basic algorithm: 2*n3 Flops•Variables are:•Data layout•Topology of machine •Scheduling communication•Use of performance models for algorithm design•Message Time = “latency” + #words * time-per-word =  + n*•Efficiency (in any model):•serial time / (p * parallel time)•perfect (linear) speedup  efficiency = 102/14/2005 CS267 Lecture 88Matrix Multiply with 1D Column Layout•Assume matrices are n x n and n is divisible by p•A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i))•B(i,j) is the n/p by n/p sublock of B(i) •in rows j*n/p through (j+1)*n/p•Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i)p0 p1 p2 p3 p5 p4 p6 p7 May be a reasonable assumption for analysis, not for code02/14/2005 CS267 Lecture 89Matrix Multiply: 1D Layout on Bus or Ring•Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i)•First consider a bus-connected machine without broadcast: only one pair of processors can communicate at a time (ethernet)•Second consider a machine with processors on a ring: all processors may communicate with nearest neighbors simultaneously02/14/2005 CS267 Lecture 810MatMul: 1D layout on Bus without BroadcastNaïve algorithm: C(myproc) = C(myproc) + A(myproc)*B(myproc,myproc) for i = 0 to p-1 for j = 0 to p-1 except i if (myproc == i) send A(i) to processor j if (myproc == j) receive A(i) from processor i C(myproc) = C(myproc) + A(i)*B(i,myproc) barrierCost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication:  + *n2 /p02/14/2005 CS267 Lecture 811Naïve MatMul (continued)Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication:  + *n2 /p … approximatelyOnly 1 pair of processors (i and j) are active on any iteration, and of those, only i is doing computation => the algorithm is almost entirely serialRunning time: = (p*(p-1) + 1)*computation + p*(p-1)*communication ~= 2*n3 + p2* + p*n2* this is worse than the serial time and grows with p02/14/2005 CS267 Lecture 812Matmul for 1D layout on a Processor Ring•Pairs of processors can communicate simultaneouslyCopy A(myproc) into TmpC(myproc) = C(myproc) + Tmp*B(myproc , myproc)for j = 1 to p-1 Send Tmp to processor myproc+1 mod p Receive Tmp from processor myproc-1 mod p C(myproc) = C(myproc) + Tmp*B( myproc-j mod p , myproc)• Same idea as for gravity in simple sharks and fish algorithm•May want double buffering in practice for overlap•Ignoring deadlock details in code• Time of inner loop = 2*( + *n2/p) + 2*n*(n/p)202/14/2005 CS267 Lecture 813Matmul for 1D layout on a Processor Ring•Time of inner loop = 2*( + *n2/p) + 2*n*(n/p)2•Total Time = 2*n* (n/p)2 + (p-1) * Time of inner loop• ~ 2*n3/p + 2*p* + 2**n2•Optimal for 1D layout on Ring or Bus, even with with Broadcast:• Perfect speedup for arithmetic• A(myproc) must move to each other processor, costs at least (p-1)*cost of sending n*(n/p) words •Parallel Efficiency = 2*n3 / (p * Total Time) = 1/(1 +  * p2/(2*n3) +  * p/(2*n) ) = 1/ (1 + O(p/n))• Grows to 1


View Full Document

Berkeley COMPSCI C267 - Dense Linear Algebra: Parallel Matrix Multiplication

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Dense Linear Algebra: Parallel Matrix Multiplication
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Dense Linear Algebra: Parallel Matrix Multiplication and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Dense Linear Algebra: Parallel Matrix Multiplication 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?