# Graph Expansion and Communication Costs of Fast Matrix Multiplication

**View the full content.**View Full Document

0 0 16 views

**Unformatted text preview:**

23rd ACM Symposium on Parallelism in Algorithms and Architectures San Jose California June 4 6 2011 Graph Expansion and Communication Costs of Fast Matrix Multiplication Oded Schwartz1 Joint work with Grey Ballard1 James Demmel1 Olga Holtz1 2 1 UC Berkeley 2 TU Berlin 1 Results by Grey Ballard UCB Aydin Buluc LBL Erin Carson UCB James Demmel UCB Jack Dongarra UTK Ioana Dumitriu U Washington Andrew Gearhart UCB Laura Grigori INRIA Ming Gu UCB Math Mark Hoemmen Sandia NL Olga Holtz UCB Math TU Berlin Nicholas Knight UCB Julien Langou U Colorado Denver Eran Rom IBM Tel Aviv Edgar Solomonik UCB Many others 2 Motivation Two kinds of costs Arithmetic FLOPs Communication moving data between levels of a memory hierarchy sequential case over a network connecting processors parallel case Communication avoiding algorithms Save time save energy Sequential Hierarchy CPU Cache M1 M2 Parallel CPU RAM CPU RAM CPU RAM CPU RAM M3 RAM Mk 3 Motivation expected bottleneck Annual hardware improvements Exponential growth with large gaps Graham Snir Patterson 04 Fuller Millett 10 CPU msec flop 59 DRAM Network Bandwidth msec word 23 26 Latency msec message 5 15 Sequential Hierarchy CPU Cache M1 M2 Parallel CPU RAM CPU RAM CPU RAM CPU RAM M3 RAM Mk 4 Outline Algorithms with flavor of 3 nested loops Lower bounds Sequential Hierarchy Parallel Ballard Demmel Holtz S 2009 Ballard Demmel Holtz S 2011a extending Hong Kung 81 Irony Toledo Tiskin 04 Algorithms Sequential Parallel Many contributions mostly new Strassen like algorithms Lower bounds Sequential Hierarchy Parallel This work Algorithms Sequential Parallel Ballard Demmel Holtz Rom S 2011 5 Lower bounds for algorithms with flavor of 3 nested loops Matrix Multiplication Hong Kung 81 Sequential n 3 M M Irony Toledo Tiskin 04 Sequential and parallel n 3 M M P 6 Lower bounds for algorithms with flavor of 3 nested loops Ballard Demmel Holtz S 2009 Ballard Demmel Holtz S 2011a Following Irony Toledo Tiskin 04 n 3 M M BLAS LU Cholesky LDLT and QR factorizations eigenvalues and singular values i e essentially all direct methods of linear algebra Dense or sparse matrices In sparse cases bandwidth is a function NNZ Bandwidth and latency Sequential hierarchical and parallel distributed and shared memory models Compositions of linear algebra operations Certain graph optimization problems 3 n M M P Demmel Pearson Poloni Van Loan 11 Tensor contraction 7 Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds Mostly not Are there other algorithms that do Mostly yes 8 Motivation a few example speedups Measured and Predicted Demmel Ballard Hoemmen Grigori Langou Dongarra Anderson 10 Anderson Ballard Demmel Keutzer 10 Bhatele Demmel Solomonik 11 Measured Parallel TSQR Intel Clovertown Up to 8x speedup 8 core dual socket 10M x 10 Pentium III cluster Dolphin Interconnect MPICH Up to 6 7x speedup 16 procs 100K x 200 BlueGene L Up to 4x speedup 32 procs 1M x 50 Tesla C 2050 Fermi Up to 13x 110 592 x 100 Predicted Parallel 2 5D LU Exascale Up to 4 5x speedup 218 nodes 222x222 9 Beyond 3 nested loops How about the communication costs of algorithms that have a more complex structure 10 Recall Strassen s Fast Matrix Multiplication Strassen 69 Compute 2 x 2 matrix multiplication using only 7 multiplications instead of 8 Apply recursively block wise M1 A11 A22 B11 B22 M2 A21 A22 B11 M3 A11 B12 B22 M4 A22 B21 B11 M5 A11 A12 B22 M6 A21 A11 B11 B12 M7 A12 A22 B21 B22 C11 M1 M4 M5 M7 C12 M3 M5 C21 M2 M4 C22 M1 M2 M3 M6 n 2 n 2 C11 C21 C12 C22 A11 A12 B11 B12 A21 A22 B21 B22 flops n 7 flops n 2 O n2 flops n nlog 7 2 11 Strassen like algorithms Compute n0 x n0 matrix multiplication n n0 using only n0 multiplications instead of n03 Apply recursively block wise 0 2 81 Strassen 69 works fast in practice 2 79 Pan 78 flops n n0 0 flops n n0 2 78 Bini 79 flops n n 0 2 55 Sch nhage 81 2 50 Pan Romani Coppersmith Winograd 84 2 48 Strassen 87 2 38 Coppersmith Winograd 90 2 38 Cohn Kleinberg Szegedy Umans 05 Group theoretic approach 0 O n2 12 New lower bound for Strassen s fast matrix multiplication Main Result The communication bandwidth lower bound is For Strassen s Strassen like log2 77 n log M M n 0 M M log2 88 n log M M n log 2 7 M M P n 0 M M P n log 2 8 M M P 2 sequential parallel Recall for cubic 0 2 The parallel lower bounds applies to one copy of the data M n2 P c copies of the data M c n2 P 13 For sequential hierarchy Yes existing implementations do For parallel Yes we think so 14 Sequential and new parallel Strassen like algorithms Sequential and Hierarchy cases Attained by the natural recursive implementation Also LU QR Black box use of fast matrix multiplication Ballard Demmel Holtz S Rom 2011 New parallel Strassen like algorithm Attains the lower bound we think This work This is as good as it gets 15 Communication Lower Bounds Proving that your algorithm implementation is as good as it gets Approaches 1 Reduction Ballard Demmel Holtz S 2009 2 Geometric Embedding Irony Toledo Tiskin 04 Ballard Demmel Holtz S 2011a 3 Graph Analysis Hong Kung 81 This work 16 Expansion 3rd approach Ballard Demmel Holtz S 2011b in the spirit of Hong Kung 81 S V S Let G V E be a d regular graph Edge expansion h min E S S S S Small set edge expansion V 2 dS ht min S S t E S S dS 17 Expansion 3rd approach The Computation Directed Acyclic Graph Input Output Intermediate value Dependency S WS S subset of computation RS reads V V S S RS WS writes Communication cost is Graph expansion 18 What is the CDAG of Strassen s algorithm 19 The DAG of Strassen n 2 Dec1C M1 A11 A22 B11 B22 M2 A21 A22 B11 M3 A11 B12 B22 M4 A22 B21 B11 M5 A11 A12 B22 M6 A21 A11 B11 B12 M7 A12 A22 B21 B22 7 5 4 1 3 2 6 C11 M1 M4 M5 M7 C12 M3 M5 C21 M2 M4 C22 M1 M2 M3 M6 1 1 1 2 2 1 2 2 1 1 1 2 2 1 2 2 Enc1A 1 1 1 2 2 1 2 2 Enc1B 20 The DAG of Strassen n 4 Dec1C One recursive level 1 1 1 2 2 1 2 2 Each vertex splits into four Multiply blocks 7 5 4 1 3 2 6 Dec1C Enc1A Enc1 B Enc1A Enc1B 21 The DAG of Strassen further recursive steps n2 Dec1C Declg nC n Enclg nA lg n 1 1 1 2 2 1 2 2 Enclg n B n2 Recursive construction Given DeciC Construct Deci 1C 1 Duplicate 4 times 2 Connect with a cross layer of Dec1C 22 Expansion of a Segment Main technical challenges Dec1C Two types of …