COSC 6374 Parallel Computation Dense Matrix Operations Edgar Gabriel Spring 2009 Edgar Gabriel Terminology Dense Matrix all elements of the matrix contain relevant values Typically stored as 2 D array e g double a 16 32 Sparse matrix most elements of the matrix are zero Optimized storage techniques Band matrices store only the relevant diagonales of the matrix Highly irregular sparse matrices store the coordinates of every non zero element together with the content Boeing Harwell format exploit certain regularities e g nearly constant number of entries per row or column Jagged Diagonal storage format see Boing Harwell format COSC 6374 Parallel Computation Edgar Gabriel 1 Replication vs Communication Large data items typically distributed across multiple processes What is large Small data items can replicated on all processes or communicated whenever required Costs for communication Costs for replication network latency memory consumption repeated computation operations COSC 6374 Parallel Computation Edgar Gabriel Matrix operations B c A Multiplying a Matrix A with a constant c Constant c is definitely small and is thus replicated on all processes E g compiled in the code Read from a configuration file Operation does not require any communication to be performed Trivially parallel Operation can be performed independent of the way the matrix has been distributed across the processes COSC 6374 Parallel Computation Edgar Gabriel 2 Matrix Operations B AT Transpose a Matrix Often not necessary since the operations e g Matrixvector multiply can be easily reformulated for MatrixTranspose vector multiply operations and avoid the data transpose Operations requiring the transpose multi dimensional FFT Assumption Matrices A B are square Element A x y should by on the same process as element B x y requires communication across the processes COSC 6374 Parallel Computation Edgar Gabriel B AT One element per process Initial data distribution one element of the Matrix A per process 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Process with coordinates x y needs to send its data item to the process with the coordinates y x and receive its data item from y x COSC 6374 Parallel Computation Edgar Gabriel 3 B AT One element per process Assumptions newcomm has been created using MPI Cart create double A B are the element of the matrices owned by each process A is already set int coords 2 my coordinates in the 2 D topology int rem coords 2 coordinates of my counterpart MPI Request req 2 MPI Status stats 2 Determine my own rank in newcomm MPI Comm rank newcomm rank Determine my own coordinates in newcomm MPI Cart coords newcomm rank ndims coords Determine the coordinates of my counterpart rem coords 0 coords 1 COSC 6374 Parallel Computation rem coords 1 coords 0 Edgar Gabriel B AT One element per process Determine the rank of my counterpart using his coordinates MPI Cart rank newcomm rem coords rem rank Initiate MPI Isend Initiate MPI Irecv non blocking communication to send A A 1 MPI DOUBLE rem coords 0 newcomm req 0 non blocking communication to receive B B 1 MPI DOUBLE rem coords 0 newcomm req 1 Wait on both non blocking operations to finish MPI Waitall 2 req stats Notes using non blocking communication avoids to have to schedule messages to avoid deadlock processes on the main diagonal send a message to themselves COSC 6374 Parallel Computation Edgar Gabriel 4 B AT Column wise data distribution One column per process rank 0 1 2 3 4 5 6 7 8 rank 0 1 2 3 4 5 6 7 8 A B Element A i needs to be sent to process i Element B i will be received from process i COSC 6374 Parallel Computation Edgar Gabriel B AT Column wise data distribution MPI Request reqs MPI Status stats int rank size double A N B N Determine the number of processes working on the problem and my rank in the communicator MPI Comm size comm size MPI Comm rank comm rank Allocate the required number of Requests and Statuses Since the code is supposed to work for arbitrary numbers of processors you can not use static arrays for reqs and stats reqs MPI Request malloc 2 size sizeof MPI Request stats MPI Status malloc 2 size sizeof MPI Status COSC 6374 Parallel Computation Edgar Gabriel 5 B AT Column wise data distribution Start now all non blocking communication operations for i 0 i size i MPI Isend A i 1 MPI DOUBLE i 0 comm reqs 2 i MPI Irecv B i 1 MPI DOUBLE i 0 comm reqs 2 i 1 Wait for all non blocking operations to finish MPI Waitall 2 size reqs stats Notes identical approach and code for row wise data distribution as long as the local portions of both A and B are stored as one dimensional arrays number of messages N2 np2 COSC 6374 Parallel Computation Edgar Gabriel B AT Block column wise data distribution Each process holds Nlocal columns of each matrix with np 1 local i 0 assuming N can be divided evenly onto np processes rank 0 1 A 2 rank 0 1 2 B COSC 6374 Parallel Computation Edgar Gabriel 6 B AT Block column wise data distribution Element A i j has to become element B j i assuming i j are global indexes Variable declarations on each process double A N Nlocal double B N Nlocal A i j is located on the process with the rank r j Nlocal has the local indexes A i1 j1 with i1 i and j1 j Nlocal B j i is located on the process with the rank s i Nlocal has the local indexes B j2 i2 with j2 j and COSC 6374 Parallel Computation i2 i Nlocal Edgar Gabriel B AT Block column wise data distribution code fragment for the communication for j1 0 j1 Nlocal j1 for i 0 i N i dest i Nlocal MPI Isend A i j1 1 MPI DOUBLE dest 0 comm reqs for j 0 j N j for i2 0 i2 Nlocal i2 src j Nlocal MPI Irecv B j i2 1 MPI DOUBLE src 0 comm reqs COSC 6374 Parallel Computation Edgar Gabriel 7 B AT Block column wise data distribution The algorithm on the previous slide is good because it doesn t require any additional temporary storage The algorithm on the previous slide is bad because it sends N2 messages with N np costs of each message is proportional to the network latency for short messages Matrix A has to be traversed in a non contiguous manner C stores multi dimensional arrays in row major order accessing A 0 0 than A 1 0 means that we jump in the main memory and have a large number of cache misses COSC 6374 Parallel Computation Edgar Gabriel Memory layout of multi dimensional arrays E g 2 D matrix Memory layout in C Memory layout in Fortran COSC 6374 Parallel Computation Edgar Gabriel 8 B AT Block column wise data distribution Alternative algorithm each process sends in reality
View Full Document
Unlocking...