DOC PREVIEW
Berkeley COMPSCI C267 - Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication

This preview shows page 1-2-3-4-5-6-7-46-47-48-49-50-51-93-94-95-96-97-98-99 out of 99 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 99 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix MultiplicationOutlineSlide 3MotifsA brief history of (Dense) Linear Algebra software (1/6)Slide 6A brief history of (Dense) Linear Algebra software (2/6)A brief history of (Dense) Linear Algebra software (3/6)Slide 9A brief history of (Dense) Linear Algebra software (4/6)A brief history of (Dense) Linear Algebra software (5/6)Success Stories for Sca/LAPACK (6/6)Back to basics: Why avoiding communication is important (1/2)Why avoiding communication is important (2/2)Review: Naïve Sequential MatMul: C = C + A*BLess Communication with Blocked Matrix MultiplyBlocked vs Cache-Oblivious AlgorithmsCommunication Lower Bounds: Prior Work on MatmulNew lower bound for all “direct” linear algebraCan we attain these lower bounds?A brief future look at (Dense) Linear Algebra softwareSlide 22What could go into the linear algebra motif(s)?For all linear algebra problems: Ex: LAPACK Table of ContentsFor all matrix/problem structures: Ex: LAPACK Table of ContentsSlide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Organizing Linear Algebra – in booksSlide 34Different Parallel Data Layouts for Matrices (not all!)Parallel Matrix-Vector ProductMatrix-Vector Product y = y + A*xParallel Matrix MultiplyMatrix Multiply with 1D Column LayoutMatrix Multiply: 1D Layout on Bus or RingMatMul: 1D layout on Bus without BroadcastNaïve MatMul (continued)Matmul for 1D layout on a Processor RingSlide 44MatMul with 2D LayoutCannon’s AlgorithmCannon’s Matrix MultiplicationInitial Step to Skew Matrices in CannonSkewing Steps in Cannon All blocks of A must multiply all like-colored blocks of BCost of Cannon’s AlgorithmCannon’s Algorithm is “optimal”Pros and Cons of CannonSUMMA AlgorithmSUMMASlide 55SUMMA performanceSlide 57Slide 58Summary of Parallel Matrix MultiplicationSlide 60Extra SlidesRecursive LayoutsGaussian EliminationGaussian Elimination via a Recursive AlgorithmRecursive FactorizationsSlide 66Review: BLAS 3 (Blocked) GEPPSlide 68Slide 69Slide 70Slide 71Slide 72Slide 73Slide 74Slide 75Slide 76Slide 77A small software project ...Work-Depth Model of ParallelismLatency Bandwidth ModelSlide 81Skewing Steps in CannonMotivation (1)Motivation (2)Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)Lessons and Questions (1)Organizing Linear Algebra (1)Organizing Linear Algebra (2)Slide 89Slide 90Slide 91Slide 92Slide 93Slide 94Slide 95Slide 96For all data types: Ex: LAPACK Table of ContentsOrganizing Linear Algebra (3)Review of the BLAS3/02/2010 CS267 Lecture 131CS 267Dense Linear Algebra:History and Structure,Parallel Matrix MultiplicationJames Demmelwww.cs.berkeley.edu/~demmel/cs267_Spr103/02/2010 CS267 Lecture 132Outline•History and motivation•Structure of the Dense Linear Algebra motif•Parallel Matrix-matrix multiplication•Parallel Gaussian Elimination (next time)3/02/2010 CS267 Lecture 133Outline•History and motivation•Structure of the Dense Linear Algebra motif•Parallel Matrix-matrix multiplication•Parallel Gaussian Elimination (next time)4MotifsThe Motifs (formerly “Dwarfs”) from “The Berkeley View” (Asanovic et al.)Motifs form key computational patternsA brief history of (Dense) Linear Algebra software (1/6)•Libraries like EISPACK (for eigenvalue problems)•Then the BLAS (1) were invented (1973-1977)•Standard library of 15 operations (mostly) on vectors•“AXPY” ( y = α·x + y ), dot product, scale (x = α·x ), etc•Up to 4 versions of each (S/D/C/Z), 46 routines, 3300 LOC•Goals•Common “pattern” to ease programming, readability, self-documentation•Robustness, via careful coding (avoiding over/underflow)•Portability + Efficiency via machine specific implementations•Why BLAS 1 ? They do O(n1) ops on O(n1) data•Used in libraries like LINPACK (for linear systems)•Source of the name “LINPACK Benchmark” (not the code!)3/02/2010 CS267 Lecture 135•In the beginning was the do-loop…3/02/2010 CS267 Lecture 136Current Records for Solving Dense Systems (11/2009) GigaflopsMachine n=100 n=1000 Any n Peak “Jaguar” (ORNL) 1.76M 2.3M (1.76 Petaflops, 224K cores, 7 MW, n=5.5M) www.top500.org, www.netlib.org/performancePalm Pilot III .00000169 (1.69 Kiloflops) NEC SX 8 (data from 2005) (8 proc, 2 GHz) 75.1 128 (1 proc, 2 GHz) 2.2 15.0 16…A brief history of (Dense) Linear Algebra software (2/6)•But the BLAS-1 weren’t enough•Consider AXPY ( y = α·x + y ): 2n flops on 3n read/writes •Computational intensity = (2n)/(3n) = 2/3•Too low to run near peak speed (read/write dominates)•Hard to vectorize (“SIMD’ize”) on supercomputers of the day (1980s)•So the BLAS-2 were invented (1984-1986)•Standard library of 25 operations (mostly) on matrix/vector pairs•“GEMV”: y = α·A·x + β·x, “GER”: A = A + α·x·yT, x = T-1·x•Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC•Why BLAS 2 ? They do O(n2) ops on O(n2) data•So computational intensity still just ~(2n2)/(n2) = 2•OK for vector machines, but not for machine with caches3/02/2010 CS267 Lecture 137A brief history of (Dense) Linear Algebra software (3/6)•The next step: BLAS-3 (1987-1988)•Standard library of 9 operations (mostly) on matrix/matrix pairs•“GEMM”: C = α·A·B + β·C, C = α·A·AT + β·C, B = T-1·B•Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC•Why BLAS 3 ? They do O(n3) ops on O(n2) data•So computational intensity (2n3)/(4n2) = n/2 – big at last!•Good for machines with caches, other mem. hierarchy levels•How much BLAS1/2/3 code so far (all at www.netlib.org/blas)•Source: 142 routines, 31K LOC, Testing: 28K LOC•Reference (unoptimized) implementation only •Ex: 3 nested loops for GEMM•Lots more optimized code (eg Homework 1)•Motivates “automatic tuning” of the BLAS (details later)3/02/2010 CS267 Lecture 13802/25/2009 CS267 Lecture 89A brief history of (Dense) Linear Algebra software (4/6)•LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now)•Ex: Obvious way to express Gaussian Elimination (GE) is adding multiples of one row to other rows – BLAS-1•How do we reorganize GE to use BLAS-3 ? (details later)•Contents of LAPACK (summary)•Algorithms we can turn into (nearly) 100% BLAS 3–Linear Systems: solve Ax=b


View Full Document

Berkeley COMPSCI C267 - Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?