Unformatted text preview:

Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Parallel Programming in C with MPI and OpenMP Michael J Quinn Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Chapter 11 Matrix Multiplication Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Outline Sequential algorithms Iterative row oriented Recursive block oriented Parallel algorithms Rowwise block striped decomposition Cannon s algorithm Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Iterative Row oriented Algorithm Series of inner product dot product operations Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Performance as n Increases Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Reason Matrix B Gets Too Big for Cache Computing a row of C requires accessing every element of B Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Block Matrix Multiplication Replace scalar multiplication with matrix multiplication Replace scalar addition with matrix addition Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Recurse Until B Small Enough Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Comparing Sequential Performance Copyright The McGraw Hill Companies Inc Permission required for reproduction or display First Parallel Algorithm Partitioning Divide matrices into rows Each primitive task has corresponding rows of three matrices Communication Each task must eventually see every row of B Organize tasks into a ring Copyright The McGraw Hill Companies Inc Permission required for reproduction or display First Parallel Algorithm cont Agglomeration and mapping Fixed number of tasks each requiring same amount of computation Regular communication among tasks Strategy Assign each process a contiguous group of rows Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Communication of B A B C A B A A B A C A C A B C A Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Communication of B A B C A B A A B A C A C A B C A Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Communication of B A B C A B A A B A C A C A B C A Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Communication of B A B C A B A A B A C A C A B C A Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Complexity Analysis Algorithm has p iterations During each iteration a process multiplies n p n p block of A by n p n block of B n3 p2 Total computation time n3 p Each process ends up passing p 1 n2 p n2 elements of B Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Isoefficiency Analysis Sequential algorithm n3 Parallel overhead pn2 Isoefficiency relation n3 Cpn2 n Cp 2 2 2 M Cp p C p p C p This system does not have good scalability Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Weakness of Algorithm 1 Blocks of B being manipulated have p times more columns than rows Each process must access every element of matrix B Ratio of computations per communication is poor only 2n p Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Parallel Algorithm 2 Cannon s Algorithm Associate a primitive task with each matrix element Agglomerate tasks responsible for a square or nearly square block of C Computation to communication ratio rises to n p Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Elements of A and B Needed to Compute a Process s Portion of C Algorithm 1 Cannon s Algorithm Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Blocks Must Be Aligned Before After Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Blocks Need to Be Aligned Each triangle represents a matrix block B00 A00 B10 A10 Only same color triangles should be multiplied B20 A20 B30 A30 B01 A01 B11 A11 B21 A21 B31 A31 B02 A02 B12 A12 B22 A22 B32 A32 B03 A03 B13 A13 B23 A23 B33 A33 Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Rearrange Blocks B00 A00 B10 A11 B20 A22 B30 A33 B11 A01 B21 A12 B31 A23 B01 A30 B22 A02 B33 A03 B32 B03 A13 A10 B02 A20 B12 A31 B13 A21 B23 A32 Block Aij cycles left i positions Block Bij cycles up j positions Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Consider Process P1 2 B22 A11 A12 B32 A13 A10 B02 B12 Step 1 Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Consider Process P1 2 B32 A12 A13 B02 A10 A11 B12 B22 Step 2 Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Consider Process P1 2 B02 A13 A10 B12 A11 A12 B22 B32 Step 3 Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Consider Process P1 2 B12 A10 A11 B22 A12 A13 B32 B02 Step 4 Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Complexity Analysis Algorithm has p iterations During each iteration process multiplies two n p n p matrices n3 p 3 2 Computational complexity n3 p During each iteration process sends and receives two blocks of size n p n p Communication complexity n2 p Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Isoefficiency Analysis Sequential algorithm n3 Parallel overhead pn2 Isoefficiency relation n3 C pn2 n C p 2 M C p p C p p C This system is highly scalable 2 Copyright The McGraw Hill Companies Inc Permission required for reproduction or display Summary Considered two sequential algorithms Iterative row oriented algorithm Recursive block oriented algorithm Second has better cache hit rate as n increases Developed two parallel algorithms First based on rowwise block striped decomposition Second based on checkerboard block decomposition Second algorithm is scalable while first is not


View Full Document
Loading Unlocking...
Login

Join to view Parallel Programming in C with MPI and OpenMP and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Parallel Programming in C with MPI and OpenMP and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?