Unformatted text preview:

CS 240A : Matrix multiplicationMatrix-Matrix MultiplicationCommunication volume modelParallel Matrix MultiplyMatrix Multiply with 1D Column LayoutMatmul for 1D layout on a Processor RingMatmul for 1D layout on a Processor RingMatMul with 2D LayoutCannon’s Algorithm: 2-D merry-go-roundCannon’s Matrix MultiplicationInitial Step to Skew Matrices in CannonSkewing Steps in CannonCommunication Volume of Cannon’s AlgorithmDrawbacks to CannonSlide 15“Naïve” 3-Loop Matrix Multiply3-Loop Matrix Multiply [Alpern et al., 1992]Avoiding data movement: Reuse and locality3-Loop Matrix Multiply [Alpern et al., 1992]Simplified model of hierarchical memory“Naïve” Matrix Multiply“Naïve” Matrix MultiplyBlocked Matrix MultiplyBlocked Matrix MultiplyMulti-Level Blocked Matrix MultiplyBLAS: Basic Linear Algebra SubroutinesBLAS speeds on an IBM RS6000/590Slide 28Slide 29Latency Bandwidth ModelMatrix Multiply with 1D Column LayoutMatmul for 1D layout on a Processor RingMatmul for 1D layout on a Processor RingMatMul with 2D LayoutCannon’s Algorithm: 2-D merry-go-roundCannon’s Matrix MultiplicationInitial Step to Skew Matrices in CannonSkewing Steps in CannonCost of Cannon’s AlgorithmSlide 40SUMMA AlgorithmSUMMASUMMASUMMA performanceSUMMA performanceCS 240A : Matrix multiplication•Matrix multiplication I : parallel issues•Matrix multiplication II: cache issuesThanks to Jim Demmel and Kathy Yelick (UCB) for some of these slidesMatrix-Matrix Multiplication{implements C = C + A*B}for i = 1 to n for j = 1 to nfor k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)= + *C(i,j) C(i,j)A(i,:)B(:,j)Algorithm has 2*n3 = O(n3) Flops and operates on 3*n2 words of memoryCommunication volume model•Network of p processors•Each with local memory•Message-passing •Communication volume (v)•Total size (words) of all messages passed during computation•Broadcasting one word costs volume p (actually, p-1)•No explicit accounting for communication time•Thus, can’t really model parallel efficiency or speedup; for that, we’d use the latency-bandwidth modelParallel Matrix Multiply•Compute C = C + A*B•Basic sequential algorithm:•C(i,j) += A(i,1)*B(1,j) + A(i,2)*B(1,j) +…+ A(i,n)*B(n,j)•work = t1 = 2n3 floating point operations (“flops”)•Variables are:•Data layout •Structure of communication •Schedule of communicationMatrix Multiply with 1D Column Layout•Assume matrices are n x n and n is divisible by p•A(i) is the n-by-n/p block column that processor i owns (similarly B(i) and C(i))•B(i,j) is the n/p-by-n/p sublock of B(i) •in rows j*n/p through (j+1)*n/p•Formula: C(i) = C(i) + A*B(i) = C(i) + Sj=0:p-1 A(j) * B(j,i)p0 p1 p2 p3 p5 p4 p6 p7 (A reasonable assumption for analysis, not for code)Matmul for 1D layout on a Processor Ring•Proc k communicates only with procs k-1 and k+1•Different pairs of processors can communicate simultaneously•Round-Robin “Merry-Go-Round” algorithmCopy A(myproc) into MGR (MGR = “Merry-Go-Round”)C(myproc) = C(myproc) + MGR*B(myproc , myproc)for j = 1 to p-1 send MGR to processor myproc+1 mod p (but see deadlock below) receive MGR from processor myproc-1 mod p (but see below) C(myproc) = C(myproc) + MGR * B( myproc-j mod p , myproc) • Avoiding deadlock: • even procs send then recv, odd procs recv then send• or, use nonblocking sends• Comm volume of one inner loop iteration = n2Matmul for 1D layout on a Processor Ring•One iteration: v = n2•All p-1 iterations: v = (p-1) * n2 ~ pn2•Optimal for 1D data layout:• Perfect speedup for arithmetic• A(myproc) must meet each C(myproc)•“Nice” communication pattern – can probably overlap independent communications in the ring.•In latency/bandwidth model (see extra slides), parallel efficiency e = 1 - O(p/n)MatMul with 2D Layout•Consider processors in 2D grid (physical or logical)•Processors can communicate with 4 nearest neighbors•Alternative pattern: broadcast along rows and columns •Assume p is square s x s grid p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2)= *Cannon’s Algorithm: 2-D merry-go-round… C(i,j) = C(i,j) + S A(i,k)*B(k,j)… assume s = sqrt(p) is an integer forall i=0 to s-1 … “skew” A left-circular-shift row i of A by i … so that A(i,j) overwritten by A(i,(j+i)mod s) forall i=0 to s-1 … “skew” B up-circular-shift B column i of B by i … so that B(i,j) overwritten by B((i+j)mod s), j) for k=0 to s-1 … sequential forall i=0 to s-1 and j=0 to s-1 … all processors in parallel C(i,j) = C(i,j) + A(i,j)*B(i,j) left-circular-shift each row of A by 1 up-circular-shift each row of B by 1 kC(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2)Cannon’s Matrix MultiplicationInitial Step to Skew Matrices in Cannon•Initial blocked input•After skewing before initial block multipliesA(0,1) A(0,2)A(1,0)A(2,0)A(1,1) A(1,2)A(2,1)A(2,2)A(0,0)B(0,1) B(0,2)B(1,0)B(2,0)B(1,1) B(1,2)B(2,1) B(2,2)B(0,0)A(0,1) A(0,2)A(1,0)A(2,0)A(1,1) A(1,2)A(2,1) A(2,2)A(0,0)B(0,1)B(0,2)B(1,0)B(2,0)B(1,1)B(1,2)B(2,1)B(2,2)B(0,0)Skewing Steps in Cannon•First step•Second•ThirdA(0,1) A(0,2)A(1,0)A(2,0)A(1,1) A(1,2)A(2,1)A(2,2)A(0,0)B(0,1)B(0,2)B(1,0)B(2,0)B(1,1)B(1,2)B(2,1)B(2,2)B(0,0)A(0,1) A(0,2)A(1,0)A(2,0)A(1,2)A(2,1)B(0,1)B(0,2)B(1,0)B(2,0)B(1,1)B(1,2)B(2,1)B(2,2)B(0,0)A(0,1)A(0,2)A(1,0)A(2,0)A(1,1) A(1,2)A(2,1) A(2,2)A(0,0) B(0,1)B(0,2)B(1,0)B(2,0)B(1,1)B(1,2)B(2,1)B(2,2)B(0,0)A(1,1)A(2,2)A(0,0)Communication Volume of Cannon’s Algorithm forall i=0 to s-1 … recall s = sqrt(p) left-circular-shift row i of A by i … v = n2 / s for each i forall i=0 to s-1 up-circular-shift B column i of B by i … v = n2 / s for each i for k=0 to s-1 forall i=0 to s-1 and j=0 to s-1 C(i,j) = C(i,j) + A(i,j)*B(i,j) left-circular-shift each row of A by 1 … v = n2 for each k up-circular-shift each row of B by 1 … v = n2 for each k° Total comm v = 2*n2 + 2* s*n2 ~ 2* sqrt(p)*n2° Again, “nice” communication pattern° In


View Full Document

UCSB CS 240A - Matrix multiplication

Download Matrix multiplication
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Matrix multiplication and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Matrix multiplication 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?