COSC 6374 Parallel Computation Dense Matrix Operations Edgar Gabriel Fall 2010 Edgar Gabriel Terminology Dense Matrix all elements of the matrix contain relevant values Typically stored as 2 D array e g double a 16 32 Sparse matrix most elements of the matrix are zero Optimized storage techniques Band matrices store only the relevant diagonals of the matrix Highly irregular sparse matrices store the coordinates of every non zero element together with the content Boeing Harwell format exploit certain regularities e g nearly constant number of entries per row or column Jagged Diagonal storage format see Boeing Harwell format Parallel Computation Edgar Gabriel 1 Replication vs Communication Large data items typically distributed across multiple processes What is large Small data items can replicated on all processes or communicated whenever required Costs for communication Costs for replication network latency memory consumption repeated computation operations Parallel Computation Edgar Gabriel Matrix operations B c A Multiplying a Matrix A with a constant c Constant c is definitely small and is thus replicated on all processes E g compiled in the code Read from a configuration file Operation does not require any communication to be performed Trivially parallel Operation can be performed independent of the way the matrix has been distributed across the processes Parallel Computation Edgar Gabriel 2 Matrix Operations B AT Transpose a Matrix Often not necessary since the operations e g Matrixvector multiply can be easily reformulated for MatrixTranspose vector multiply operations and avoid the data transpose Operations requiring the transpose multi dimensional FFT Assumption Matrices A B are square Element A x y should be on the same process as element B x y requires communication across the processes Parallel Computation Edgar Gabriel B AT One element per process Initial data distribution one element of the Matrix A per process 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Process with coordinates x y needs to send its data item to the process with the coordinates y x and receive its data item from y x Parallel Computation Edgar Gabriel 3 B AT One element per process Assumptions newcomm has been created using MPI Cart create double A B are the element of the matrices owned by each process A is already set int coords 2 my coordinates in the 2 D topology int rem coords 2 coordinates of my counterpart MPI Request req 2 MPI Status stats 2 Determine my own rank in newcomm MPI Comm rank newcomm rank Determine my own coordinates in newcomm MPI Cart coords newcomm rank ndims coords Determine the coordinates of my counterpart rem coords 0 coords 1 Parallel Computation rem coords 1 coords 0 Edgar Gabriel B AT One element per process Determine the rank of my counterpart using his coordinates MPI Cart rank newcomm rem coords rem rank Initiate MPI Isend Initiate MPI Irecv non blocking communication to send A A 1 MPI DOUBLE rem rank 0 newcomm req 0 non blocking communication to receive B B 1 MPI DOUBLE rem rank 0 newcomm req 1 Wait on both non blocking operations to finish MPI Waitall 2 req stats Notes using non blocking communication avoids to have to schedule messages to avoid deadlock processes on the main diagonal send a message to themselves Parallel Computation Edgar Gabriel 4 B AT Column wise data distribution One column per process rank 0 1 2 3 4 5 6 7 8 rank 0 1 2 3 4 5 6 7 8 A B Element A i needs to be sent to process i Element B i will be received from process i Parallel Computation Edgar Gabriel B AT Column wise data distribution MPI Request reqs MPI Status stats int rank size double A N B N Determine the number of processes working on the problem and my rank in the communicator MPI Comm size comm size MPI Comm rank comm rank Allocate the required number of Requests and Statuses Since the code is supposed to work for arbitrary numbers of processors you can not use static arrays for reqs and stats reqs MPI Request malloc 2 size sizeof MPI Request stats MPI Status malloc 2 size sizeof MPI Status Parallel Computation Edgar Gabriel 5 B AT Column wise data distribution Start now all non blocking communication operations for i 0 i size i MPI Isend A i 1 MPI DOUBLE i 0 comm reqs 2 i MPI Irecv B i 1 MPI DOUBLE i 0 comm reqs 2 i 1 Wait for all non blocking operations to finish MPI Waitall 2 size reqs stats Notes identical approach and code for row wise data distribution as long as the local portions of both A and B are stored as one dimensional arrays number of messages N2 np2 Parallel Computation Edgar Gabriel B AT Block column wise data distribution Each process holds Nlocal columns of each matrix with np 1 N N local i 0 assuming N can be divided evenly onto np processes rank 0 A 1 2 rank 0 1 2 B Parallel Computation Edgar Gabriel 6 B AT Block column wise data distribution Element A i j has to become element B j i assuming i j are global indexes Variable declarations on each process double A N Nlocal double B N Nlocal A i j is located on the process with the rank r j Nlocal has the local indexes A i1 j1 with i1 i and j1 j Nlocal B j i is located on the process with the rank s i Nlocal has the local indexes B j2 i2 with j2 j and Parallel Computation i2 i Nlocal Edgar Gabriel B AT Block column wise data distribution code fragment for the communication for j1 0 j1 Nlocal j1 for i 0 i N i dest i Nlocal MPI Isend A i j1 1 MPI DOUBLE dest 0 comm reqs for j 0 j N j for i2 0 i2 Nlocal i2 src j Nlocal MPI Irecv B j i2 1 MPI DOUBLE src 0 comm reqs Parallel Computation Edgar Gabriel 7 B AT Block column wise data distribution The algorithm on the previous slide is good because it doesn t require any additional temporary storage The algorithm on the previous slide is bad because it sends N2 messages with N np costs of each message is proportional to the network latency for short messages Matrix A has to be traversed in a non contiguous manner C stores multi dimensional arrays in row major order accessing A 0 0 than A 1 0 means that we jump in the main memory and have a large number of cache misses Parallel Computation Edgar Gabriel Memory layout of multi dimensional arrays E g 2 D matrix Memory layout in C Memory layout in Fortran Parallel Computation Edgar Gabriel 8 B AT Block column wise data distribution Alternative algorithm each process sends in reality Nlocal Nlocal elements to every other process send an entire block of Nlocal Nlocal elements block has to be transposed either at the sender or at the
View Full Document
Unlocking...