CS 240A : Matrix multiplicationMatrix-Matrix MultiplicationCommunication volume modelParallel Matrix MultiplyMatrix Multiply with 1D Column LayoutMatmul for 1D layout on a Processor RingMatmul for 1D layout on a Processor RingMatMul with 2D LayoutCannon’s Algorithm: 2-D merry-go-roundCannon’s Matrix MultiplicationInitial Step to Skew Matrices in CannonSkewing Steps in CannonCommunication Volume of Cannon’s AlgorithmDrawbacks to CannonSlide 15“Naïve” 3-Loop Matrix Multiply3-Loop Matrix Multiply [Alpern et al., 1992]Avoiding data movement: Reuse and locality3-Loop Matrix Multiply [Alpern et al., 1992]Simplified model of hierarchical memory“Naïve” Matrix Multiply“Naïve” Matrix MultiplyBlocked Matrix MultiplyBlocked Matrix MultiplyMulti-Level Blocked Matrix MultiplyBLAS: Basic Linear Algebra SubroutinesBLAS speeds on an IBM RS6000/590Slide 28Slide 29Latency Bandwidth ModelMatrix Multiply with 1D Column LayoutMatmul for 1D layout on a Processor RingMatmul for 1D layout on a Processor RingMatMul with 2D LayoutCannon’s Algorithm: 2-D merry-go-roundCannon’s Matrix MultiplicationInitial Step to Skew Matrices in CannonSkewing Steps in CannonCost of Cannon’s AlgorithmSlide 40SUMMA AlgorithmSUMMASUMMASUMMA performanceSUMMA performanceCS 240A : Matrix multiplication•Matrix multiplication I : parallel issues•Matrix multiplication II: cache issuesThanks to Jim Demmel and Kathy Yelick (UCB) for some of these slidesMatrix-Matrix Multiplication{implements C = C + A*B}for i = 1 to n for j = 1 to nfor k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)= + *C(i,j) C(i,j)A(i,:)B(:,j)Algorithm has 2*n3 = O(n3) Flops and operates on 3*n2 words of memoryCommunication volume model•Network of p processors•Each with local memory•Message-passing •Communication volume (v)•Total size (words) of all messages passed during computation•Broadcasting one word costs volume p (actually, p-1)•No explicit accounting for communication time•Thus, can’t really model parallel efficiency or speedup; for that, we’d use the latency-bandwidth modelParallel Matrix Multiply•Compute C = C + A*B•Basic sequential algorithm:•C(i,j) += A(i,1)*B(1,j) + A(i,2)*B(1,j) +…+ A(i,n)*B(n,j)•work = t1 = 2n3 floating point operations (“flops”)•Variables are:•Data layout •Structure of communication •Schedule of communicationMatrix Multiply with 1D Column Layout•Assume matrices are n x n and n is divisible by p•A(i) is the n-by-n/p block column that processor i owns (similarly B(i) and C(i))•B(i,j) is the n/p-by-n/p sublock of B(i) •in rows j*n/p through (j+1)*n/p•Formula: C(i) = C(i) + A*B(i) = C(i) + Sj=0:p-1 A(j) * B(j,i)p0 p1 p2 p3 p5 p4 p6 p7 (A reasonable assumption for analysis, not for code)Matmul for 1D layout on a Processor Ring•Proc k communicates only with procs k-1 and k+1•Different pairs of processors can communicate simultaneously•Round-Robin “Merry-Go-Round” algorithmCopy A(myproc) into MGR (MGR = “Merry-Go-Round”)C(myproc) = C(myproc) + MGR*B(myproc , myproc)for j = 1 to p-1 send MGR to processor myproc+1 mod p (but see deadlock below) receive MGR from processor myproc-1 mod p (but see below) C(myproc) = C(myproc) + MGR * B( myproc-j mod p , myproc) • Avoiding deadlock: • even procs send then recv, odd procs recv then send• or, use nonblocking sends• Comm volume of one inner loop iteration = n2Matmul for 1D layout on a Processor Ring•One iteration: v = n2•All p-1 iterations: v = (p-1) * n2 ~ pn2•Optimal for 1D data layout:• Perfect speedup for arithmetic• A(myproc) must meet each C(myproc)•“Nice” communication pattern – can probably overlap independent communications in the ring.•In latency/bandwidth model (see extra slides), parallel efficiency e = 1 - O(p/n)MatMul with 2D Layout•Consider processors in 2D grid (physical or logical)•Processors can communicate with 4 nearest neighbors•Alternative pattern: broadcast along rows and columns •Assume p is square s x s grid p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) p(0,0) p(0,1) p(0,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2)= *Cannon’s Algorithm: 2-D merry-go-round… C(i,j) = C(i,j) + S A(i,k)*B(k,j)… assume s = sqrt(p) is an integer forall i=0 to s-1 … “skew” A left-circular-shift row i of A by i … so that A(i,j) overwritten by A(i,(j+i)mod s) forall i=0 to s-1 … “skew” B up-circular-shift B column i of B by i … so that B(i,j) overwritten by B((i+j)mod s), j) for k=0 to s-1 … sequential forall i=0 to s-1 and j=0 to s-1 … all processors in parallel C(i,j) = C(i,j) + A(i,j)*B(i,j) left-circular-shift each row of A by 1 up-circular-shift each row of B by 1 kC(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2)Cannon’s Matrix MultiplicationInitial Step to Skew Matrices in Cannon•Initial blocked input•After skewing before initial block multipliesA(0,1) A(0,2)A(1,0)A(2,0)A(1,1) A(1,2)A(2,1)A(2,2)A(0,0)B(0,1) B(0,2)B(1,0)B(2,0)B(1,1) B(1,2)B(2,1) B(2,2)B(0,0)A(0,1) A(0,2)A(1,0)A(2,0)A(1,1) A(1,2)A(2,1) A(2,2)A(0,0)B(0,1)B(0,2)B(1,0)B(2,0)B(1,1)B(1,2)B(2,1)B(2,2)B(0,0)Skewing Steps in Cannon•First step•Second•ThirdA(0,1) A(0,2)A(1,0)A(2,0)A(1,1) A(1,2)A(2,1)A(2,2)A(0,0)B(0,1)B(0,2)B(1,0)B(2,0)B(1,1)B(1,2)B(2,1)B(2,2)B(0,0)A(0,1) A(0,2)A(1,0)A(2,0)A(1,2)A(2,1)B(0,1)B(0,2)B(1,0)B(2,0)B(1,1)B(1,2)B(2,1)B(2,2)B(0,0)A(0,1)A(0,2)A(1,0)A(2,0)A(1,1) A(1,2)A(2,1) A(2,2)A(0,0) B(0,1)B(0,2)B(1,0)B(2,0)B(1,1)B(1,2)B(2,1)B(2,2)B(0,0)A(1,1)A(2,2)A(0,0)Communication Volume of Cannon’s Algorithm forall i=0 to s-1 … recall s = sqrt(p) left-circular-shift row i of A by i … v = n2 / s for each i forall i=0 to s-1 up-circular-shift B column i of B by i … v = n2 / s for each i for k=0 to s-1 forall i=0 to s-1 and j=0 to s-1 C(i,j) = C(i,j) + A(i,j)*B(i,j) left-circular-shift each row of A by 1 … v = n2 for each k up-circular-shift each row of B by 1 … v = n2 for each k° Total comm v = 2*n2 + 2* s*n2 ~ 2* sqrt(p)*n2° Again, “nice” communication pattern° In
View Full Document