DOC PREVIEW
Berkeley COMPSCI C267 - Lecture 13: Parallel Matrix Multiply

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 33 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix MultiplySlide 2Model overviewDense Linear Algebra in ElectromagenticsSlide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Parallel Matrix MultiplySlide 151D LayoutMatrix Multiply: 1D Layout on BusMatMul on 1D Bus without BroadcastNaïve MatMul (continued)Better MatMul on a BusPerformance of “Better” AlgorithmSlide 22MatMul: 1D Layout and BroadcastMatMul with 2D LayoutCannon’s AlgorithmCommunication in CannonBLAS ReviewSlide 28Slide 29Slide 30Slide 31Slide 32Slide 33267 Lecture 13 01/14/19 1CS 267 Applications of Parallel ProcessorsLecture 13: Parallel Matrix MultiplyKathy Yelickhttp://www.cs.berkeley.edu/~dmartin/cs267267 Lecture 13 01/14/19 2Outline- Recap - Sources of large dense linear systems- BLAS for linear algebra- Parallel Matrix multiply267 Lecture 13 01/14/19 3Model overview•Work-depth•PRAM•Latency/Bandwidth model–  is the 1-time cost per message (latency)–  is the per byte cost of communication–Use this today•LogP model–correction: gap should be greater than overhead–more on this with parallel sorting•Topology-specific models267 Lecture 13 01/14/19 4Dense Linear Algebra in Electromagentics267 Lecture 13 01/14/19 5Computational Electromagnetics• developed during 1980s, driven by defense applications• determine the RCS (radar cross section) of airplane• reduce signature of plane (stealth technology)• other applications are antenna design, medical equipment• two fundamental numerical approaches: •MOM methods of moments ( frequency domain), and • finite differences (time domain)267 Lecture 13 01/14/19 6Computational Electromagnetics image: NW Univ. Comp. Electromagnetics Laboratory http://nueml.ece.nwu.edu/- discretize surface into triangular facets using standard modeling tools- amplitude of currents on surface are unknowns - integral equation is discretized into a set of linear equations267 Lecture 13 01/14/19 7Computational Electromagnetics (MOM)After discretization the integral equation has the form Z J = V whereZ is the impedance matrix, J is the unknown vector of amplitudes, and V is the excitation vector.(see Cwik, Patterson, and Scott, Electromagnetic Scattering on the Intel Touchstone Delta, IEEE Supercomputing ‘92, pp 538 - 542)267 Lecture 13 01/14/19 8The main steps in the solution process areA) computing the matrix elementsB) factoring the dense matrixC) solving for one or more excitations (RHS)D) computing the fields scattered from the objectComputational Electromagnetics (MOM)267 Lecture 13 01/14/19 9Analysis of MOM for Parallel ImplementationTask Work Parallelism Parallel SpeedFill O(n**2) embarrassing lowFactor O(n**3) moderately diff. very highSolve O(n**2) moderately diff. highField Calc. O(n) embarrassing highFor most scientific applications the biggest gain in performance is from parallelism within each task.267 Lecture 13 01/14/19 10Results for Parallel Implementation on DeltaTask Time (hours)Fill 9.20 Factor 8.25Solve 2.17 Field Calc. 0.12The problem solved was for a matrix of size 48,672. (The world record in 1991.)267 Lecture 13 01/14/19 11Current Records for Solving Dense SystemsYear System Size Machine1950's O(100) 1991 55,296 CM-2 1992 75,264 Intel 1993 75,264 Intel 1994 76,800 CM-5 1995 128,600 Paragon XP1996 215,000 ASCI Red (Tflop) source: Alan Edelman http://www-math.mit.edu/~edelman/records.html267 Lecture 13 01/14/19 12Sources for large dense linear systems- Not many basic factorizations outside CEM- Large dense eigen problems used in chemistry- Alternatives often debatedChoice for algorithms in existing codes are not the result of careful planning and design. - Reflect the start-of-the-art at the time, - May be purely coincidental.267 Lecture 13 01/14/19 13Solving Large Dense Linear Systems see Demmel http://HTTP.CS.Berkeley.EDU/~demmel/cs267/lecture12/lecture12.htmlGaussian elimination to solve Ax=b where A is a dense matrix•Add multiples of each row to subsequent rows in order to create zeros below the diagonal•End up with an upper triangular matrix U. •Solve a linear system with U by substitution, starting with the last variable. Solving these systems uses basic vector and matrix operations called BLAS.267 Lecture 13 01/14/19 14Parallel Matrix Multiply267 Lecture 13 01/14/19 15Parallel Matrix Multiply•Computing C=C+A*B•Using basic algorithm: 2*n3 Flops•Variables are:–Data layout–Topology of machine –Scheduling communication•Use of performance models for algorithm design267 Lecture 13 01/14/19 161D Layout•Assume matrices are nxn and n is divisible by p•A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i))•B(i,j) is the n/p by n/p sublock of B(i) –in rows j*n/p through (j+1)*n/pp0 p1 p2 p3 p5 p4 p6 p7267 Lecture 13 01/14/19 17Matrix Multiply: 1D Layout on Bus•Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) +  A(j)*B(j,i)•First consider a bus-connected machine without broadcast: only one pair of processors can communicate at a time•Second consider a bus-connected machine with broadcast: may send from one to many in single stepj <pj =0267 Lecture 13 01/14/19 18MatMul on 1D Bus without BroadcastNaïve algorithm: C(myproc) = C(myproc) + A(myproc)*b(myproc,myproc) for i = 0 to p-1 for j = 0 to p-1 except i if (myproc == i) send A(i) to processor j // message passing if (myproc == j) receive A(i) from processor i C(myproc) = C(myproc) + A(i)*B(i,myproc) barrierCost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: *p2 + *p*n2 // approximately267 Lecture 13 01/14/19 19Naïve MatMul (continued)Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: *p2 + *p*n2 // approximatelyOnly 1 pair of processors (i and j) are active on any iteration, an of those, only i is doing computation => the algorithm is almost entirely serialRunning time: (p*(p-1) + 1)*computation + p*(p-1)*communication ~= 2*n3 + p2* + p*n2* this is


View Full Document

Berkeley COMPSCI C267 - Lecture 13: Parallel Matrix Multiply

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Lecture 13: Parallel Matrix Multiply
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 13: Parallel Matrix Multiply and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 13: Parallel Matrix Multiply 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?