DOC PREVIEW
Berkeley COMPSCI C267 - Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication

This preview shows page 1-2-3-4-5-6-7-51-52-53-54-55-56-57-58-102-103-104-105-106-107-108 out of 108 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 108 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix MultiplicationQuick review of earlier lectureOutlineSlide 4MotifsWhat is dense linear algebra?A brief history of (Dense) Linear Algebra software (1/7)PowerPoint PresentationA brief history of (Dense) Linear Algebra software (2/7)A brief history of (Dense) Linear Algebra software (3/7)Slide 11A brief history of (Dense) Linear Algebra software (4/7)A brief history of (Dense) Linear Algebra software (5/7)Success Stories for Sca/LAPACK (6/7)A brief future look at (Dense) Linear Algebra software (7/7)Back to basics: Why avoiding communication is important (1/2)Why avoiding communication is important (2/2)Review: Naïve Sequential MatMul: C = C + A*BLess Communication with Blocked Matrix MultiplyBlocked vs Cache-Oblivious AlgorithmsCommunication Lower Bounds: Prior Work on MatmulNew lower bound for all “direct” linear algebraSlide 23Can we attain these lower bounds?Slide 25What could go into the linear algebra motif(s)?For all linear algebra problems: Ex: LAPACK Table of ContentsWhat does Ab do? What could it do? Ex: LAPACK Table of ContentsSlide 29Slide 30Slide 31Slide 32Organizing Linear Algebra – in booksSlide 34Different Parallel Data Layouts for Matrices (not all!)Parallel Matrix-Vector ProductMatrix-Vector Product y = y + A*xParallel Matrix MultiplyMatrix Multiply with 1D Column LayoutMatrix Multiply: 1D Layout on Bus or RingMatMul: 1D layout on Bus without BroadcastNaïve MatMul (continued)Matmul for 1D layout on a Processor RingSlide 44Need to try 2D Matrix layoutSummary of Parallel Matrix MultiplySUMMA AlgorithmSUMMA uses Outer Product form of MatMulSUMMA – n x n matmul on P1/2 x P1/2 gridSUMMA– n x n matmul on P1/2 x P1/2 gridSUMMA CostsSlide 52Can we do better?2.5D Matrix MultiplicationSlide 552.5D Matmul on IBM BG/P, n=64K2.5D Matmul on IBM BG/P, 16K nodes / 64K coresSlide 58Parallel StrassenStrong scaling of Matmul on Hopper (n=94080)Slide 61Slide 62Slide 63Proof of Communication Lower Bound on C = A·B (1/5)Proof of Communication Lower Bound on C = A·B (2/5)Proof of Communication Lower Bound on C = A·B (3/5)Proof of Communication Lower Bound on C = A·B (4/5)Proof of Communication Lower Bound on C = A·B (5/5)Extra SlidesRecursive LayoutsGaussian EliminationGaussian Elimination via a Recursive AlgorithmRecursive FactorizationsSlide 85Review: BLAS 3 (Blocked) GEPPSlide 87Slide 88Slide 89Slide 90Slide 91Slide 92Slide 93Slide 94Slide 95Slide 96A small software project ...Work-Depth Model of ParallelismLatency Bandwidth ModelSlide 100Skewing Steps in CannonMotivation (1)Motivation (2)Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)Lessons and Questions (1)Organizing Linear Algebra (1)Organizing Linear Algebra (2)Slide 108Slide 109Slide 110Slide 111Slide 112Slide 113Slide 114Slide 115For all data types: Ex: LAPACK Table of ContentsOrganizing Linear Algebra (3)Review of the BLASSummary of Parallel Matrix Multiplication so far02/21/2012 CS267 Lecture 111CS 267Dense Linear Algebra:History and Structure,Parallel Matrix MultiplicationJames Demmelwww.cs.berkeley.edu/~demmel/cs267_Spr12Quick review of earlier lecture•What do you call•A program written in PyGAS, a Global Address Space language based on Python…•That uses a Monte Carlo simulation algorithm to approximate π …•That has a race condition, so that it gives you a different funny answer everytime you run it? Monte - π - thon02/21/2012 CS267 Lecture 11202/21/2012 CS267 Lecture 113Outline•History and motivation•Lower bound on communication•Structure of the Dense Linear Algebra motif•What does A\b do?•Parallel Matrix-matrix multiplication•Attaining the lower bound•Proof of the lower bound•Parallel Gaussian Elimination (next time)02/21/2012 CS267 Lecture 114Outline•History and motivation•Lower bound on communication•Structure of the Dense Linear Algebra motif•What does A\b do?•Parallel Matrix-matrix multiplication•Attaining the lower bound•Proof of the lower bound•Parallel Gaussian Elimination (next time)5MotifsThe Motifs (formerly “Dwarfs”) from “The Berkeley View” (Asanovic et al.)Motifs form key computational patternsWhat is dense linear algebra?•Not just matmul!•Linear Systems: Ax=b•Least Squares: choose x to minimize ||Ax-b||2•Overdetermined or underdetermined•Unconstrained, constrained, weighted•Eigenvalues and vectors of Symmetric Matrices•Standard (Ax = λx), Generalized (Ax=λBx)•Eigenvalues and vectors of Unsymmetric matrices•Eigenvalues, Schur form, eigenvectors, invariant subspaces•Standard, Generalized•Singular Values and vectors (SVD)•Standard, Generalized•Different matrix structures•Real, complex; Symmetric, Hermitian, positive definite; dense, triangular, banded …•Level of detail•Simple Driver (“x=A\b”)•Expert Drivers with error bounds, extra-precision, other options•Lower level routines (“apply certain kind of orthogonal transformation”, matmul…)CS267 Lecture 11602/21/2012A brief history of (Dense) Linear Algebra software (1/7)•Libraries like EISPACK (for eigenvalue problems)•Then the BLAS (1) were invented (1973-1977)•Standard library of 15 operations (mostly) on vectors•“AXPY” ( y = α·x + y ), dot product, scale (x = α·x ), etc•Up to 4 versions of each (S/D/C/Z), 46 routines, 3300 LOC•Goals•Common “pattern” to ease programming, readability•Robustness, via careful coding (avoiding over/underflow)•Portability + Efficiency via machine specific implementations•Why BLAS 1 ? They do O(n1) ops on O(n1) data•Used in libraries like LINPACK (for linear systems)•Source of the name “LINPACK Benchmark” (not the code!)02/21/2012 CS267 Lecture 117•In the beginning was the do-loop…02/21/2012 CS267 Lecture 118Current Records for Solving Dense Systems (11/2011)• Linpack Benchmark •Fastest machine overall (www.top500.org)• Fujitsu K-Computer (RIKEN Institute, Japan)• 10.5 Petaflops out of 11.3 Petaflops peak • n = 11.9M, 29.5 hours to run• 705K cores, 12.7 MW of power• Historical data (www.netlib.org/performance)• Palm Pilot III • 1.69 Kiloflops• n = 100A brief history of (Dense) Linear Algebra software (2/7)•But the BLAS-1 weren’t enough•Consider AXPY ( y = α·x + y ): 2n flops on 3n read/writes •Computational intensity = (2n)/(3n) = 2/3•Too low to run near peak speed (read/write dominates)•Hard to vectorize (“SIMD’ize”) on supercomputers


View Full Document

Berkeley COMPSCI C267 - Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?