Berkeley COMPSCI C267 - Lecture 20: Dense Linear Algebra - II - D2026361

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Lecture 20: Dense Linear Algebra - II

DOC PREVIEW

Berkeley COMPSCI C267 - Lecture 20: Dense Linear Algebra - II

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 43

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 267 Applications of Parallel Computers Lecture 20: Dense Linear Algebra - IIReview of last lecture and OutlineBLAS2 version of Gaussian Elimination with Partial Pivoting (GEPP)BLAS 3 (Blocked) GEPP, using Delayed UpdatesParallelizing Gaussian EliminationDifferent Data Layouts for Parallel GE (on 4 procs)How to proceed:Parallel Matrix Multiply1D LayoutMatrix Multiply: 1D Layout on Bus or RingNaïve MatMul for 1D layout on Bus without BroadcastNaïve MatMul (continued)Better Matmul for 1D layout on a Processor RingMatMul with 2D LayoutCannon’s AlgorithmCommunication in CannonCost of Cannon’s AlgorithmDrawbacks to CannonSUMMA = Scalable Universal Matrix Multiply AlgorithmSUMMASUMMA performanceSlide 22Summary of Parallel Matrix Multiply AlgorithmsReview: BLAS 3 (Blocked) GEPPSlide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35A small software project ...Recent work and open problemsPHIPAC - Automatic Optimization of MatmulSparsitySparsity PerformanceSlide 41FFTWOpen problemCS267 L20 Dense Linear Algebra II.1Demmel Sp 1999CS 267 Applications of Parallel ComputersLecture 20: Dense Linear Algebra - IIJames Demmelhttp://www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_20_2000.pptCS267 L20 Dense Linear Algebra II.2Demmel Sp 1999Review of last lecture and Outline°Motivation for Dense Linear Algebra (Lectures 10-11)•Ax=b: Computational Electromagnetics•Ax = x: Quantum Chemistry°Review Gaussian Elimination (GE) for solving Ax=b°Optimizing GE for caches on sequential machines•using matrix-matrix multiplication (BLAS)•Other BLAS3 important too°LAPACK library overview and performance°Review GE°Data layouts on parallel machines°Parallel matrix-matrix multiplication°Parallel Gaussian Elimination°ScaLAPACK library overview°Recent work, open problemsCS267 L20 Dense Linear Algebra II.3Demmel Sp 1999BLAS2 version of Gaussian Elimination with Partial Pivoting (GEPP)for i = 1 to n-1 find and record k where |A(k,i)| = max{i <= j <= n} |A(j,i)| … i.e. largest entry in rest of column i if |A(k,i)| = 0 exit with a warning that A is singular, or nearly so elseif k != i swap rows i and k of A end if A(i+1:n,i) = A(i+1:n,i) / A(i,i) … each quotient lies in [-1,1] … BLAS 1 A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n) … BLAS 2, most work in this lineCS267 L20 Dense Linear Algebra II.4Demmel Sp 1999BLAS 3 (Blocked) GEPP, using Delayed Updatesfor ib = 1 to n-1 step b … Process matrix b columns at a time end = ib + b-1 … Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’ … let LL denote the strict lower triangular part of A(ib:end , ib:end) + I A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n) … update next b rows of U A(end+1:n , end+1:n ) = A(end+1:n , end+1:n ) - A(end+1:n , ib:end) * A(ib:end , end+1:n) … apply delayed updates with single matrix-multiply … with inner dimension bBLAS 3CS267 L20 Dense Linear Algebra II.5Demmel Sp 1999Parallelizing Gaussian Elimination°Recall parallelization steps from Lecture 3•Decomposition: identify enough parallel work, but not too much•Assignment: load balance work among threads•Orchestrate: communication and synchronization•Mapping: which processors execute which threads°Decomposition•In BLAS 2 algorithm nearly each flop in inner loop can be done in parallel, so with n2 processors, need 3n parallel steps•This is too fine-grained, prefer calls to local matmuls instead•Need to discuss parallel matrix multiplication°Assignment•Which processors are responsible for which submatrices? for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector) A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update) - A(i+1:n , i) * A(i , i+1:n)CS267 L20 Dense Linear Algebra II.6Demmel Sp 1999Different Data Layouts for Parallel GE (on 4 procs) The winner!Bad load balance:P0 idle after firstn/4 stepsLoad balanced, but can’t easilyuse BLAS2 or BLAS3Can trade load balanceand BLAS2/3 performance by choosing b, butfactorization of blockcolumn is a bottleneckComplicated addressingMore later...CS267 L20 Dense Linear Algebra II.7Demmel Sp 1999How to proceed:°Consider basic parallel matrix multiplication algorithms on simple layouts•Performance modeling to choose best one-Time (message) = latency + #words * time-per-word- =  + n*°Briefly discuss block-cyclic layout°PBLAS = Parallel BLASCS267 L20 Dense Linear Algebra II.8Demmel Sp 1999Parallel Matrix Multiply°Computing C=C+A*B°Using basic algorithm: 2*n3 Flops°Variables are:•Data layout•Topology of machine •Scheduling communication°Use of performance models for algorithm designCS267 L20 Dense Linear Algebra II.9Demmel Sp 19991D Layout°Assume matrices are n x n and n is divisible by p°A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i))°B(i,j) is the n/p by n/p sublock of B(i) •in rows j*n/p through (j+1)*n/p°Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) +  A(j)*B(j,i)p0 p1 p2 p3 p5 p4 p6 p7 jCS267 L20 Dense Linear Algebra II.10Demmel Sp 1999Matrix Multiply: 1D Layout on Bus or Ring°Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) +  A(j)*B(j,i)°First consider a bus-connected machine without broadcast: only one pair of processors can communicate at a time (ethernet)°Second consider a machine with processors on a ring: all processors may communicate with nearest neighbors simultaneouslyjCS267 L20 Dense Linear Algebra II.11Demmel Sp 1999Naïve MatMul for 1D layout on Bus without BroadcastNaïve algorithm: C(myproc) = C(myproc) + A(myproc)*B(myproc,myproc) for i = 0 to p-1 for j = 0 to p-1 except i if (myproc == i) send A(i) to processor j if (myproc == j) receive A(i) from processor i C(myproc) = C(myproc) + A(i)*B(i,myproc) barrierCost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication:  + *n2 /pCS267 L20 Dense Linear Algebra II.12Demmel Sp 1999Naïve MatMul (continued)Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication:  + *n2 /p … approximatelyOnly 1 pair of processors (i and j) are active on any iteration, an of those, only i is doing

View Full Document

Berkeley COMPSCI C267 - Lecture 20: Dense Linear Algebra - II

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Berkeley COMPSCI C267 - Lecture 20: Dense Linear Algebra - II

Sign up for free to view:

Please select your school