DOC PREVIEW
Berkeley COMPSCI C267 - Dense Linear Algebra: Possible Class Projects

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 267 Dense Linear Algebra: Possible Class ProjectsKinds of class projectsChallenges to Libraries (and parallel SW in general)Strassen’s Matmul on Multicore or GPUReview: Alternative recursive GE formulationRegister-file resident Linear Algebra on GPUsExtend Vasily’s GPU analysis, code to ATIMissing Drivers in Sca/LAPACKMore missing driversMissing matrix types in ScaLAPACKSlide 11Slide 12Parallel Eigenvalue Algorithms on GPUExperiment with PLASMA for MulticoreFork-Join vs. Dynamic Execution on MulticoreSlide 16Investigate role of “Dense Motif” in ParLab Apps03/04/2009 CS267 Lecture 12a 1CS 267 Dense Linear Algebra:Possible Class ProjectsJames Demmelwww.cs.berkeley.edu/~demmel/cs267_Spr0903/04/2009 CS267 Lecture 12a 2Kinds of class projects•Try tuning existing (widely used) codes in LAPACK, ScaLAPACK or possible future versions-Possible impact: help many people to run faster•Add missing functionality to these libraries-Possible impact: lots of users want it•Experiment with algorithms on new architectures-Possible impact: What do we need to do differently for performance on these platforms? Are there any bottlenecks or other problems in the architecture? Could they be fixed?•Experiment with new software approaches-Possible impact: Is it easier to write these algorithms while getting most of the performance? Should we produce future versions of the libraries this way?•Experiment with new algorithms-Possible impact: Find a better one!Challenges to Libraries (and parallel SW in general)•Minimizing communication costs-Cost of bandwidth and latency (to main memory or over a network) growing exponentially compared to arithmetic•Heterogeneous platforms-Different communication costs depending on destination•Same chip vs different socket vs different board …-CPU + GPU•Perform different operations at very different rates•Dynamic scheduling & load balancing-Can’t always assume each core/processor makes constant progress on your task-May be faster to grab next available task than use predesigned “perfectly balanced” schedule-OS may give, take away resources on the fly•Fault tolerance – how to recover when one proc fails03/02/2009 CS267 Lecture 11 3Strassen’s Matmul on Multicore or GPU•Why no Strassen in most libraries?-See “Baleful Effect of Benchmarks…” by Prof. Kahan•Likely to be faster for modest-to-large matrix sizes-Where is the crossover?•May want hybrid: switch to O(n3) algorithm for certain sizes (smaller)-Autotuning?•Lots of “blocking” opportunities as for standard matmul-What is least amount of data movement possible?•How well does it work for the rectangular matmuls in LU, QR and Cholesky?-Do we need to modify LU, QR or Cholesky to take advantage of Strassen (by using a variant that multiplies different size matrices)?03/04/2009 CS267 Lecture 12a 4Review: Alternative recursive GE formulation • Toledo (1997) -Describe without pivoting for simplicity-“Do left half of matrix, then right half”03/04/2009 CS267 Lecture 12a 5 function [L,U] = RLU (A) … assume A is m by n if (n=1) L = A/A(1,1), U = A(1,1) else [L1,U1] = RLU( A(1:m , 1:n/2)) … do left half of A … let L11 denote top n/2 rows of L1 A( 1:n/2 , n/2+1 : n ) = L11-1 * A( 1:n/2 , n/2+1 : n ) … update top n/2 rows of right half of A A( n/2+1: m, n/2+1:n ) = A( n/2+1: m, n/2+1:n ) - A( n/2+1: m, 1:n/2 ) * A( 1:n/2 , n/2+1 : n ) … update rest of right half of A [L2,U2] = RLU( A(n/2+1:m , n/2+1:n) ) … do right half of A return [ L1,[0;L2] ] and [U1, [ A(.,.) ; U2 ] ]A = L * URegister-file resident Linear Algebra on GPUs•Vasily’s results for LU, QR and Cholesky on GPU target single large matrices, too large to fit just in the “fast memory” (shared + registers) of the GPU•There is also demand for solving many smaller problems in parallel, eg A(i) * x(i) = b(i) for many different A(1),…,A(k) and b(1),…,b(k)•Project: Design linear algebra algorithms that operate on many different matrices in parallel, each small enough to fit in the 64 KB register set of each multiprocessor -single precision square matrix of dimension n=128•Question: Does possible need to branch differently on each multiprocessor (because of different pivot orders) matter? If so, is QR better than LU?•Question: Do we need BLAS3 code versions on such small matrices, or is BLAS2 enough?03/04/2009 CS267 Lecture 12a 6Extend Vasily’s GPU analysis, code to ATI•Vasily’s Best Student Paper Award from SC08 had two parts:-Analyzed bottlenecks, speedup possibilities in NVIDIA architecture-Applied lessons to reorganization of LU, QR, Cholesky•What about ATI GPU?-Both above aspects interesting-ATI GPU available in ParLab-What are pros, cons of ATI, NVIDIA architectures? Others?-Do we need to reorganize algorithms differently for each, or does one algorithm (perhaps with different block sizes, other parameters) work for both (which would be simpler)?•Other BLAS-like operations on GPU-Needed for finite-element analysis03/04/2009 CS267 Lecture 12a 7Missing Drivers in Sca/LAPACKLAPACK ScaLAPACKLinear EquationsLUCholeskyLDLTxGESVxPOSVxSYSVPxGESVPxPOSVmissingLeast Squares (LS)QRQR+pivotSVD/QRSVD/D&CSVD/MRRRQR + iterative refine.xGELSxGELSYxGELSSxGELSDmissing (oops)missingPxGELSmissing drivermissing drivermissing (intent)missing(oops)missingGeneralized LS LS + equality constr.Generalized LMAbove + Iterative ref.xGGLSExGGGLMmissingmissingmissingmissingMore missing driversLAPACK ScaLAPACKSymmetric EVD QR / Bisection+InvitD&CMRRRxSYEV / XxSYEVDxSYEVRPxSYEV / Xmissing (intent)missingNonsymmetric EVD Schur formVectors tooxGEES / XxGEEV /Xmissing drivermissing driverSVD QRD&CMRRRJacobixGESVDxGESDDmissing(oops)xGESVJPxGESVDmissing (intent)missing(oops)missingGeneralized Symmetric EVDQR / Bisection+InvitD&CMRRRxSYGV / XxSYGVDmissingPxSYGV / Xmissing (intent)missingGeneralized Nonsymmetric EVDSchur formVectors tooxGGES / XxGGEV / XmissingmissingGeneralized SVD KogbetliantzMRRRxGGSVDmissing(oops)missing (intent)missing(oops)Missing matrix types in ScaLAPACK•Symmetric, Hermitian, triangular-Band, Packed •Positive Definite-Packed•Orthogonal, Unitary-Packed0102030405060708090100seconds100020003000400050006000700080009000100001x602x303x204x155x126x10problem sizegrid shapeExecution time of PDGESV for various grid shape90-10080-9070-8060-7050-6040-5030-4020-3010-200-10Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster


View Full Document

Berkeley COMPSCI C267 - Dense Linear Algebra: Possible Class Projects

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Dense Linear Algebra: Possible Class Projects
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Dense Linear Algebra: Possible Class Projects and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Dense Linear Algebra: Possible Class Projects 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?