DOC PREVIEW
Graph Expansion and Communication Costs of Fast Matrix Multiplication

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Graph Expansion and Communication Costs of Fast Matrix MultiplicationResults by:MotivationMotivation: expected bottleneckPowerPoint PresentationLower bounds: for algorithms with “flavor” of 3 nested loopsSlide 7Slide 8Slide 9Beyond 3-nested loopsRecall: Strassen’s Fast Matrix MultiplicationStrassen-like algorithmsNew lower bound for Strassen’s fast matrix multiplicationSlide 14Sequential and new parallel Strassen-like algorithmsCommunication Lower BoundsSlide 17Expansion (3rd approach)Slide 19The DAG of Strassen, n = 2The DAG of Strassen, n=4The DAG of Strassen: further recursive stepsExpansion of a SegmentOpen ProblemsSlide 25Extra slidesUpper bounds – Supporting Data StructuresSlide 28Slide 29ApplicationsSlide 31Is it a Good Expander?Estimating the edge expansion- CombinatoriallyEstimating the BW - by Spectral-GapThe DAG of StrassenReduction (1st approach) [Ballard, Demmel, Holtz, S. 2009a]Slide 37Slide 38Slide 39Dense Linear Algebra: Sequential ModelDense 2D parallel algorithmsNew 2.5D parallel algorithms: Matrix multiplication and LU decompositionSlide 431 Graph Expansion andCommunication Costs of Fast Matrix MultiplicationSan Jose, California, June 4 - 6, 2011Oded Schwartz1Joint work with:Grey Ballard1, James Demmel1, Olga Holtz1,21. UC-Berkeley 2. TU-Berlin 23rd ACM Symposium on Parallelism in Algorithms and ArchitecturesResults by:•Grey Ballard, UCB •Aydin Buluc, LBL•Erin Carson, UCB•James Demmel, UCB •Jack Dongarra, UTK•Ioana Dumitriu, U. Washington•Andrew Gearhart, UCB•Laura Grigori, INRIA•Ming Gu, UCB Math•Mark Hoemmen, Sandia NL•Olga Holtz, UCB Math & TU Berlin•Nicholas Knight, UCB•Julien Langou, U. Colorado Denver•Eran Rom, IBM Tel-Aviv•Edgar Solomonik, UCBMany others…23MotivationTwo kinds of costs:Arithmetic (FLOPs)Communication: moving data between •levels of a memory hierarchy (sequential case) •over a network connecting processors (parallel case)Communication-avoiding algorithms:Save time, save energy.CPURAMCPURAMCPURAMCPURAMParallelCPUCacheRAMSequentialM1M2M3Mk = Hierarchy4Motivation: expected bottleneckCPURAMCPURAMCPURAMCPURAMParallelCPUCacheRAMSequentialM1M2M3Mk = Hierarchy Annual hardware improvements:Exponential growth with large gaps[Graham, Snir,Patterson, 04], [Fuller, Millett, 10]CPU(msec/flop)DRAM Network59%Bandwidth(msec/word)23% 26%Latency(msec/message)5% 15%[Ballard, Demmel, Holtz, S. 2009],[Ballard, Demmel, Holtz, S. 2011a] extending[Hong & Kung 81], [Irony,Toledo,Tiskin 04]OutlineAlgorithms with “flavor” of 3 nested loops•Lower bounds: Sequential, Hierarchy, Parallel. •Algorithms: Sequential, ParallelStrassen-like algorithms•Lower bounds: Sequential, Hierarchy, Parallel. •Algorithms: Sequential, Parallel 5Many contributions, mostly newThis work[Ballard, Demmel, Holtz, Rom, S. 2011]6Lower bounds: for algorithms with “flavor” of 3 nested loopsMatrix Multiplication[Hong & Kung 81]•Sequential[Irony,Toledo,Tiskin 04]•Sequential and parallel MMn3PMMn37Lower bounds: for algorithms with “flavor” of 3 nested loops[Ballard, Demmel, Holtz, S. 2009],[Ballard, Demmel, Holtz, S. 2011a]Following [Irony,Toledo,Tiskin 04]•BLAS, LU, Cholesky, LDLT, and QR factorizations, eigenvalues and singular values, i.e., essentially all direct methods of linear algebra.•Dense or sparse matricesIn sparse cases: bandwidth is a function NNZ.•Bandwidth and latency.•Sequential, hierarchical, and parallel – distributed and shared memory models.•Compositions of linear algebra operations.•Certain graph optimization problems[Demmel, Pearson, Poloni, Van Loan, 11]•Tensor contraction MMn3PMMn38Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds?Mostly not. Are there other algorithms that do?Mostly yes.9Motivation: a few example speedups,Measured and PredictedMeasured: Parallel TSQR–Intel Clovertown–Up to 8x speedup (8 core, dual socket, 10M x 10)–Pentium III cluster, Dolphin Interconnect, MPICH•Up to 6.7x speedup (16 procs, 100K x 200)–BlueGene/L•Up to 4x speedup (32 procs, 1M x 50)–Tesla C 2050 / Fermi•Up to 13x (110,592 x 100)Predicted: Parallel 2.5D LU–Exascale –Up to 4.5x speedup (218 nodes, 222x222)[Demmel, Ballard, Hoemmen, Grigori, Langou, Dongarra, Anderson 10][Anderson, Ballard, Demmel Keutzer 10][Bhatele, Demmel, Solomonik 11]10Beyond 3-nested loopsHow about the communication costs of algorithmsthat have a more complex structure?11[Strassen 69]•Compute 2 x 2 matrix multiplication using only 7 multiplications (instead of 8).•Apply recursively (block-wise)M1 = (A11 + A22)  (B11 + B22)M2 = (A21 + A22)  B11M3 = A11  (B12 - B22)M4 = A22  (B21 - B11)M5 = (A11+ A12)  B22M6 = (A21 - A11)  (B11 + B12)M7 = (A12 - A22)  (B21 + B22)C11 = M1 + M4 - M5 + M7C12 = M3 + M5C21 = M2 + M4C22 = M1 - M2 + M3 + M6Recall: Strassen’s Fast Matrix MultiplicationC21C22C11C12n/2n/2A21A22A11A12B21B22B11B12=flops(n) = 7flops(n/2) + O(n2)flops(n) = (nlog27)12Strassen-like algorithms•Compute n0 x n0 matrix multiplication using only n00 multiplications (instead of n03).•Apply recursively (block-wise)0 2.81 [Strassen 69] works fast in practice.2.79 [Pan 78]2.78 [Bini 79]2.55 [Schönhage 81]2.50 [Pan Romani,Coppersmith Winograd 84]2.48 [Strassen 87]2.38 [Coppersmith Winograd 90] 2.38 [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approachflops(n) = n00 flops(n/n0) + O(n2)flops(n) = (n0)n/n0=13New lower bound for Strassen’s fast matrix multiplicationMain Result:The communication bandwidth lower bound is MMn7log2 MMn0 MMn8log2PMMn7log2PMMn0PMMn8log2Strassen-like: Recall for cubic:For Strassen’s:The parallel lower bounds applies toone copy of the data: M = (n2/P)c copies of the data: M = (c∙n2/P)log2 7log2 80sequentialparallel14For sequential? hierarchy?Yes, existing implementations do!For parallel?Yes, we think so15Sequential and new parallel Strassen-like algorithmsSequential and Hierarchy cases: Attained by the


Graph Expansion and Communication Costs of Fast Matrix Multiplication

Download Graph Expansion and Communication Costs of Fast Matrix Multiplication
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Graph Expansion and Communication Costs of Fast Matrix Multiplication and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Graph Expansion and Communication Costs of Fast Matrix Multiplication 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?