CS 267 Dense Linear Algebra Parallel Matrix Multiplication James Demmel www cs berkeley edu demmel cs267 Spr06 02 09 2006 CS267 Lecture 8 1 Outline Recall BLAS Basic Linear Algebra Subroutines Matrix vector multiplication in parallel Matrix matrix multiplication in parallel 02 09 2006 CS267 Lecture 8 2 Review of the BLAS Building blocks for all linear algebra Parallel versions call serial versions on each processor So they must be fast Recall q flops mem refs The larger is q the faster the algorithm can go in the presence of memory hierarchy axpy y x y where scalar x and y vectors BLAS level Ex mem refs flops q 1 Axpy Dot prod 3n 2n1 2 3 2 Matrixvector mult n2 2n2 2 3 Matrixmatrix mult 4n2 2n3 n 2 02 09 2006 CS267 Lecture 8 3 Different Parallel Data Layouts for Matrices 0123012301230123 0 1 2 3 1 1D Column Blocked Layout 2 1D Column Cyclic Layout 0 1 2 3 0 1 2 3 4 Row versions of the previous layouts b 3 1D Column Block Cyclic Layout 0 0 1 0 1 0 1 0 1 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 5 2D Row and Column Blocked Layout 02 09 2006 Generalizes others 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 CS267 Lecture 8 2 3 2 3 2 3 2 3 6 2D Row and Column Block Cyclic Layout 4 Parallel Matrix Vector Product Compute y y A x where A is a dense matrix Layout 1D row blocked A i refers to the n by n p block row that processor i owns x i and y i similarly refer to segments of x y owned by i y Algorithm Foreach processor i Broadcast x i Compute y i A i x P0 P1 P2 P3 x P0 P1 P2 P3 Algorithm uses the formula y i y i A i x y i j A i j x j 02 09 2006 CS267 Lecture 8 5 Matrix Vector Product y y A x A column layout of the matrix eliminates the broadcast of x But adds a reduction to update the destination y A 2D blocked layout uses a broadcast and reduction both on a subset of processors sqrt p for square processor grid 02 09 2006 P0 P1 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 CS267 Lecture 8 6 Parallel Matrix Multiply Computing C C A B Using basic algorithm 2 n3 Flops Variables are Data layout Topology of machine Scheduling communication Use of performance models for algorithm design Message Time latency words time per word n Efficiency in any model serial time p parallel time perfect linear speedup efficiency 1 02 09 2006 CS267 Lecture 8 7 Matrix Multiply with 1D Column Layout Assume matrices are n x n and n is divisible by p p0 p1 p2 p3 p4 p5 p6 p7 May be a reasonable assumption for analysis not for code A i refers to the n by n p block column that processor i owns similiarly for B i and C i B i j is the n p by n p sublock of B i in rows j n p through j 1 n p Algorithm uses the formula C i C i A B i C i j A j B j i 02 09 2006 CS267 Lecture 8 8 Matrix Multiply 1D Layout on Bus or Ring Algorithm uses the formula C i C i A B i C i j A j B j i First consider a bus connected machine without broadcast only one pair of processors can communicate at a time ethernet Second consider a machine with processors on a ring all processors may communicate with nearest neighbors simultaneously 02 09 2006 CS267 Lecture 8 9 MatMul 1D layout on Bus without Broadcast Na ve algorithm C myproc C myproc A myproc B myproc myproc for i 0 to p 1 for j 0 to p 1 except i if myproc i send A i to processor j if myproc j receive A i from processor i C myproc C myproc A i B i myproc barrier Cost of inner loop computation 2 n n p 2 2 n3 p2 communication n2 p 02 09 2006 CS267 Lecture 8 10 Na ve MatMul continued Cost of inner loop computation 2 n n p 2 2 n3 p2 communication n2 p approximately Only 1 pair of processors i and j are active on any iteration and of those only i is doing computation the algorithm is almost entirely serial Running time p p 1 1 computation p p 1 communication 2 n3 p2 p n2 This is worse than the serial time and grows with p Why might you still want to do this 02 09 2006 CS267 Lecture 8 11 Matmul for 1D layout on a Processor Ring Pairs of processors can communicate simultaneously Copy A myproc into Tmp C myproc C myproc Tmp B myproc myproc for j 1 to p 1 Send Tmp to processor myproc 1 mod p Receive Tmp from processor myproc 1 mod p C myproc C myproc Tmp B myproc j mod p myproc Same idea as for gravity in simple sharks and fish algorithm May want double buffering in practice for overlap Ignoring deadlock details in code Time of inner loop 2 n2 p 2 n n p 2 02 09 2006 CS267 Lecture 8 12 Matmul for 1D layout on a Processor Ring Time of inner loop 2 n2 p 2 n n p 2 Total Time 2 n n p 2 p 1 Time of inner loop 2 n3 p 2 p 2 n2 Optimal for 1D layout on Ring or Bus even with with Broadcast Perfect speedup for arithmetic A myproc must move to each other processor costs at least p 1 cost of sending n n p words Parallel Efficiency 2 n3 p Total Time 1 1 p2 2 n3 p 2 n 1 1 O p n Grows to 1 as n p increases or and shrink 02 09 2006 CS267 Lecture 8 13 MatMul with 2D Layout Consider processors in 2D grid physical or logical Processors can communicate with 4 nearest neighbors Broadcast along rows and columns p 0 0 p 0 2 p 0 1 p 1 0 p 1 2 p 1 1 p 2 0 p 2 2 p 2 1 p 0 0 p 0 2 p 0 1 p 1 0 p 1 2 p 1 1 p 2 0 p 2 2 p 2 1 p 0 0 p 0 2 p 0 1 p 1 0 p 1 2 p 1 1 p 2 0 p 2 2 p 2 1 Assume p processors form square s x s grid 02 09 2006 CS267 Lecture 8 14 Cannon s Algorithm C i j C …
View Full Document