Graph Expansion and Communication Costs of Fast Matrix MultiplicationResults by:MotivationMotivation: expected bottleneckPowerPoint PresentationLower bounds: for algorithms with “flavor” of 3 nested loopsSlide 7Slide 8Slide 9Beyond 3-nested loopsRecall: Strassen’s Fast Matrix MultiplicationStrassen-like algorithmsNew lower bound for Strassen’s fast matrix multiplicationSlide 14Sequential and new parallel Strassen-like algorithmsCommunication Lower BoundsSlide 17Expansion (3rd approach)Slide 19The DAG of Strassen, n = 2The DAG of Strassen, n=4The DAG of Strassen: further recursive stepsExpansion of a SegmentOpen ProblemsSlide 25Extra slidesUpper bounds – Supporting Data StructuresSlide 28Slide 29ApplicationsSlide 31Is it a Good Expander?Estimating the edge expansion- CombinatoriallyEstimating the BW - by Spectral-GapThe DAG of StrassenReduction (1st approach) [Ballard, Demmel, Holtz, S. 2009a]Slide 37Slide 38Slide 39Dense Linear Algebra: Sequential ModelDense 2D parallel algorithmsNew 2.5D parallel algorithms: Matrix multiplication and LU decompositionSlide 431 Graph Expansion andCommunication Costs of Fast Matrix MultiplicationSan Jose, California, June 4 - 6, 2011Oded Schwartz1Joint work with:Grey Ballard1, James Demmel1, Olga Holtz1,21. UC-Berkeley 2. TU-Berlin 23rd ACM Symposium on Parallelism in Algorithms and ArchitecturesResults by:•Grey Ballard, UCB •Aydin Buluc, LBL•Erin Carson, UCB•James Demmel, UCB •Jack Dongarra, UTK•Ioana Dumitriu, U. Washington•Andrew Gearhart, UCB•Laura Grigori, INRIA•Ming Gu, UCB Math•Mark Hoemmen, Sandia NL•Olga Holtz, UCB Math & TU Berlin•Nicholas Knight, UCB•Julien Langou, U. Colorado Denver•Eran Rom, IBM Tel-Aviv•Edgar Solomonik, UCBMany others…23MotivationTwo kinds of costs:Arithmetic (FLOPs)Communication: moving data between •levels of a memory hierarchy (sequential case) •over a network connecting processors (parallel case)Communication-avoiding algorithms:Save time, save energy.CPURAMCPURAMCPURAMCPURAMParallelCPUCacheRAMSequentialM1M2M3Mk = Hierarchy4Motivation: expected bottleneckCPURAMCPURAMCPURAMCPURAMParallelCPUCacheRAMSequentialM1M2M3Mk = Hierarchy Annual hardware improvements:Exponential growth with large gaps[Graham, Snir,Patterson, 04], [Fuller, Millett, 10]CPU(msec/flop)DRAM Network59%Bandwidth(msec/word)23% 26%Latency(msec/message)5% 15%[Ballard, Demmel, Holtz, S. 2009],[Ballard, Demmel, Holtz, S. 2011a] extending[Hong & Kung 81], [Irony,Toledo,Tiskin 04]OutlineAlgorithms with “flavor” of 3 nested loops•Lower bounds: Sequential, Hierarchy, Parallel. •Algorithms: Sequential, ParallelStrassen-like algorithms•Lower bounds: Sequential, Hierarchy, Parallel. •Algorithms: Sequential, Parallel 5Many contributions, mostly newThis work[Ballard, Demmel, Holtz, Rom, S. 2011]6Lower bounds: for algorithms with “flavor” of 3 nested loopsMatrix Multiplication[Hong & Kung 81]•Sequential[Irony,Toledo,Tiskin 04]•Sequential and parallel MMn3PMMn37Lower bounds: for algorithms with “flavor” of 3 nested loops[Ballard, Demmel, Holtz, S. 2009],[Ballard, Demmel, Holtz, S. 2011a]Following [Irony,Toledo,Tiskin 04]•BLAS, LU, Cholesky, LDLT, and QR factorizations, eigenvalues and singular values, i.e., essentially all direct methods of linear algebra.•Dense or sparse matricesIn sparse cases: bandwidth is a function NNZ.•Bandwidth and latency.•Sequential, hierarchical, and parallel – distributed and shared memory models.•Compositions of linear algebra operations.•Certain graph optimization problems[Demmel, Pearson, Poloni, Van Loan, 11]•Tensor contraction MMn3PMMn38Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds?Mostly not. Are there other algorithms that do?Mostly yes.9Motivation: a few example speedups,Measured and PredictedMeasured: Parallel TSQR–Intel Clovertown–Up to 8x speedup (8 core, dual socket, 10M x 10)–Pentium III cluster, Dolphin Interconnect, MPICH•Up to 6.7x speedup (16 procs, 100K x 200)–BlueGene/L•Up to 4x speedup (32 procs, 1M x 50)–Tesla C 2050 / Fermi•Up to 13x (110,592 x 100)Predicted: Parallel 2.5D LU–Exascale –Up to 4.5x speedup (218 nodes, 222x222)[Demmel, Ballard, Hoemmen, Grigori, Langou, Dongarra, Anderson 10][Anderson, Ballard, Demmel Keutzer 10][Bhatele, Demmel, Solomonik 11]10Beyond 3-nested loopsHow about the communication costs of algorithmsthat have a more complex structure?11[Strassen 69]•Compute 2 x 2 matrix multiplication using only 7 multiplications (instead of 8).•Apply recursively (block-wise)M1 = (A11 + A22) (B11 + B22)M2 = (A21 + A22) B11M3 = A11 (B12 - B22)M4 = A22 (B21 - B11)M5 = (A11+ A12) B22M6 = (A21 - A11) (B11 + B12)M7 = (A12 - A22) (B21 + B22)C11 = M1 + M4 - M5 + M7C12 = M3 + M5C21 = M2 + M4C22 = M1 - M2 + M3 + M6Recall: Strassen’s Fast Matrix MultiplicationC21C22C11C12n/2n/2A21A22A11A12B21B22B11B12=flops(n) = 7flops(n/2) + O(n2)flops(n) = (nlog27)12Strassen-like algorithms•Compute n0 x n0 matrix multiplication using only n00 multiplications (instead of n03).•Apply recursively (block-wise)0 2.81 [Strassen 69] works fast in practice.2.79 [Pan 78]2.78 [Bini 79]2.55 [Schönhage 81]2.50 [Pan Romani,Coppersmith Winograd 84]2.48 [Strassen 87]2.38 [Coppersmith Winograd 90] 2.38 [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approachflops(n) = n00 flops(n/n0) + O(n2)flops(n) = (n0)n/n0=13New lower bound for Strassen’s fast matrix multiplicationMain Result:The communication bandwidth lower bound is MMn7log2 MMn0 MMn8log2PMMn7log2PMMn0PMMn8log2Strassen-like: Recall for cubic:For Strassen’s:The parallel lower bounds applies toone copy of the data: M = (n2/P)c copies of the data: M = (c∙n2/P)log2 7log2 80sequentialparallel14For sequential? hierarchy?Yes, existing implementations do!For parallel?Yes, we think so15Sequential and new parallel Strassen-like algorithmsSequential and Hierarchy cases: Attained by the
or
We will never post anything without your permission.
Don't have an account? Sign up