CS 267 Applications of Parallel Computers Lecture 20: Dense Linear Algebra - IIReview of last lecture and OutlineBLAS2 version of Gaussian Elimination with Partial Pivoting (GEPP)BLAS 3 (Blocked) GEPP, using Delayed UpdatesParallelizing Gaussian EliminationDifferent Data Layouts for Parallel GE (on 4 procs)How to proceed:Parallel Matrix Multiply1D LayoutMatrix Multiply: 1D Layout on Bus or RingNaïve MatMul for 1D layout on Bus without BroadcastNaïve MatMul (continued)Better Matmul for 1D layout on a Processor RingMatMul with 2D LayoutCannon’s AlgorithmCommunication in CannonCost of Cannon’s AlgorithmDrawbacks to CannonSUMMA = Scalable Universal Matrix Multiply AlgorithmSUMMASUMMA performanceSlide 22Summary of Parallel Matrix Multiply AlgorithmsReview: BLAS 3 (Blocked) GEPPSlide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35A small software project ...Recent work and open problemsPHIPAC - Automatic Optimization of MatmulSparsitySparsity PerformanceSlide 41FFTWOpen problemCS267 L20 Dense Linear Algebra II.1Demmel Sp 1999CS 267 Applications of Parallel ComputersLecture 20: Dense Linear Algebra - IIJames Demmelhttp://www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_20_2000.pptCS267 L20 Dense Linear Algebra II.2Demmel Sp 1999Review of last lecture and Outline°Motivation for Dense Linear Algebra (Lectures 10-11)•Ax=b: Computational Electromagnetics•Ax = x: Quantum Chemistry°Review Gaussian Elimination (GE) for solving Ax=b°Optimizing GE for caches on sequential machines•using matrix-matrix multiplication (BLAS)•Other BLAS3 important too°LAPACK library overview and performance°Review GE°Data layouts on parallel machines°Parallel matrix-matrix multiplication°Parallel Gaussian Elimination°ScaLAPACK library overview°Recent work, open problemsCS267 L20 Dense Linear Algebra II.3Demmel Sp 1999BLAS2 version of Gaussian Elimination with Partial Pivoting (GEPP)for i = 1 to n-1 find and record k where |A(k,i)| = max{i <= j <= n} |A(j,i)| … i.e. largest entry in rest of column i if |A(k,i)| = 0 exit with a warning that A is singular, or nearly so elseif k != i swap rows i and k of A end if A(i+1:n,i) = A(i+1:n,i) / A(i,i) … each quotient lies in [-1,1] … BLAS 1 A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n) … BLAS 2, most work in this lineCS267 L20 Dense Linear Algebra II.4Demmel Sp 1999BLAS 3 (Blocked) GEPP, using Delayed Updatesfor ib = 1 to n-1 step b … Process matrix b columns at a time end = ib + b-1 … Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’ … let LL denote the strict lower triangular part of A(ib:end , ib:end) + I A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n) … update next b rows of U A(end+1:n , end+1:n ) = A(end+1:n , end+1:n ) - A(end+1:n , ib:end) * A(ib:end , end+1:n) … apply delayed updates with single matrix-multiply … with inner dimension bBLAS 3CS267 L20 Dense Linear Algebra II.5Demmel Sp 1999Parallelizing Gaussian Elimination°Recall parallelization steps from Lecture 3•Decomposition: identify enough parallel work, but not too much•Assignment: load balance work among threads•Orchestrate: communication and synchronization•Mapping: which processors execute which threads°Decomposition•In BLAS 2 algorithm nearly each flop in inner loop can be done in parallel, so with n2 processors, need 3n parallel steps•This is too fine-grained, prefer calls to local matmuls instead•Need to discuss parallel matrix multiplication°Assignment•Which processors are responsible for which submatrices? for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector) A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update) - A(i+1:n , i) * A(i , i+1:n)CS267 L20 Dense Linear Algebra II.6Demmel Sp 1999Different Data Layouts for Parallel GE (on 4 procs) The winner!Bad load balance:P0 idle after firstn/4 stepsLoad balanced, but can’t easilyuse BLAS2 or BLAS3Can trade load balanceand BLAS2/3 performance by choosing b, butfactorization of blockcolumn is a bottleneckComplicated addressingMore later...CS267 L20 Dense Linear Algebra II.7Demmel Sp 1999How to proceed:°Consider basic parallel matrix multiplication algorithms on simple layouts•Performance modeling to choose best one-Time (message) = latency + #words * time-per-word- = + n*°Briefly discuss block-cyclic layout°PBLAS = Parallel BLASCS267 L20 Dense Linear Algebra II.8Demmel Sp 1999Parallel Matrix Multiply°Computing C=C+A*B°Using basic algorithm: 2*n3 Flops°Variables are:•Data layout•Topology of machine •Scheduling communication°Use of performance models for algorithm designCS267 L20 Dense Linear Algebra II.9Demmel Sp 19991D Layout°Assume matrices are n x n and n is divisible by p°A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i))°B(i,j) is the n/p by n/p sublock of B(i) •in rows j*n/p through (j+1)*n/p°Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) + A(j)*B(j,i)p0 p1 p2 p3 p5 p4 p6 p7 jCS267 L20 Dense Linear Algebra II.10Demmel Sp 1999Matrix Multiply: 1D Layout on Bus or Ring°Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) + A(j)*B(j,i)°First consider a bus-connected machine without broadcast: only one pair of processors can communicate at a time (ethernet)°Second consider a machine with processors on a ring: all processors may communicate with nearest neighbors simultaneouslyjCS267 L20 Dense Linear Algebra II.11Demmel Sp 1999Naïve MatMul for 1D layout on Bus without BroadcastNaïve algorithm: C(myproc) = C(myproc) + A(myproc)*B(myproc,myproc) for i = 0 to p-1 for j = 0 to p-1 except i if (myproc == i) send A(i) to processor j if (myproc == j) receive A(i) from processor i C(myproc) = C(myproc) + A(i)*B(i,myproc) barrierCost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: + *n2 /pCS267 L20 Dense Linear Algebra II.12Demmel Sp 1999Naïve MatMul (continued)Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: + *n2 /p … approximatelyOnly 1 pair of processors (i and j) are active on any iteration, an of those, only i is doing
View Full Document