Berkeley COMPSCI C267 - Dense Linear Algebra: Possible Class Projects - D251080

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Dense Linear Algebra: Possible Class Projects

DOC PREVIEW

Berkeley COMPSCI C267 - Dense Linear Algebra: Possible Class Projects

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 17

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 267 Dense Linear Algebra: Possible Class ProjectsKinds of class projectsChallenges to Libraries (and parallel SW in general)Strassen’s Matmul on Multicore or GPUReview: Alternative recursive GE formulationRegister-file resident Linear Algebra on GPUsExtend Vasily’s GPU analysis, code to ATIMissing Drivers in Sca/LAPACKMore missing driversMissing matrix types in ScaLAPACKSlide 11Slide 12Parallel Eigenvalue Algorithms on GPUExperiment with PLASMA for MulticoreFork-Join vs. Dynamic Execution on MulticoreSlide 16Investigate role of “Dense Motif” in ParLab Apps03/04/2009 CS267 Lecture 12a 1CS 267 Dense Linear Algebra:Possible Class ProjectsJames Demmelwww.cs.berkeley.edu/~demmel/cs267_Spr0903/04/2009 CS267 Lecture 12a 2Kinds of class projects•Try tuning existing (widely used) codes in LAPACK, ScaLAPACK or possible future versions-Possible impact: help many people to run faster•Add missing functionality to these libraries-Possible impact: lots of users want it•Experiment with algorithms on new architectures-Possible impact: What do we need to do differently for performance on these platforms? Are there any bottlenecks or other problems in the architecture? Could they be fixed?•Experiment with new software approaches-Possible impact: Is it easier to write these algorithms while getting most of the performance? Should we produce future versions of the libraries this way?•Experiment with new algorithms-Possible impact: Find a better one!Challenges to Libraries (and parallel SW in general)•Minimizing communication costs-Cost of bandwidth and latency (to main memory or over a network) growing exponentially compared to arithmetic•Heterogeneous platforms-Different communication costs depending on destination•Same chip vs different socket vs different board …-CPU + GPU•Perform different operations at very different rates•Dynamic scheduling & load balancing-Can’t always assume each core/processor makes constant progress on your task-May be faster to grab next available task than use predesigned “perfectly balanced” schedule-OS may give, take away resources on the fly•Fault tolerance – how to recover when one proc fails03/02/2009 CS267 Lecture 11 3Strassen’s Matmul on Multicore or GPU•Why no Strassen in most libraries?-See “Baleful Effect of Benchmarks…” by Prof. Kahan•Likely to be faster for modest-to-large matrix sizes-Where is the crossover?•May want hybrid: switch to O(n3) algorithm for certain sizes (smaller)-Autotuning?•Lots of “blocking” opportunities as for standard matmul-What is least amount of data movement possible?•How well does it work for the rectangular matmuls in LU, QR and Cholesky?-Do we need to modify LU, QR or Cholesky to take advantage of Strassen (by using a variant that multiplies different size matrices)?03/04/2009 CS267 Lecture 12a 4Review: Alternative recursive GE formulation • Toledo (1997) -Describe without pivoting for simplicity-“Do left half of matrix, then right half”03/04/2009 CS267 Lecture 12a 5 function [L,U] = RLU (A) … assume A is m by n if (n=1) L = A/A(1,1), U = A(1,1) else [L1,U1] = RLU( A(1:m , 1:n/2)) … do left half of A … let L11 denote top n/2 rows of L1 A( 1:n/2 , n/2+1 : n ) = L11-1 * A( 1:n/2 , n/2+1 : n ) … update top n/2 rows of right half of A A( n/2+1: m, n/2+1:n ) = A( n/2+1: m, n/2+1:n ) - A( n/2+1: m, 1:n/2 ) * A( 1:n/2 , n/2+1 : n ) … update rest of right half of A [L2,U2] = RLU( A(n/2+1:m , n/2+1:n) ) … do right half of A return [ L1,[0;L2] ] and [U1, [ A(.,.) ; U2 ] ]A = L * URegister-file resident Linear Algebra on GPUs•Vasily’s results for LU, QR and Cholesky on GPU target single large matrices, too large to fit just in the “fast memory” (shared + registers) of the GPU•There is also demand for solving many smaller problems in parallel, eg A(i) * x(i) = b(i) for many different A(1),…,A(k) and b(1),…,b(k)•Project: Design linear algebra algorithms that operate on many different matrices in parallel, each small enough to fit in the 64 KB register set of each multiprocessor -single precision square matrix of dimension n=128•Question: Does possible need to branch differently on each multiprocessor (because of different pivot orders) matter? If so, is QR better than LU?•Question: Do we need BLAS3 code versions on such small matrices, or is BLAS2 enough?03/04/2009 CS267 Lecture 12a 6Extend Vasily’s GPU analysis, code to ATI•Vasily’s Best Student Paper Award from SC08 had two parts:-Analyzed bottlenecks, speedup possibilities in NVIDIA architecture-Applied lessons to reorganization of LU, QR, Cholesky•What about ATI GPU?-Both above aspects interesting-ATI GPU available in ParLab-What are pros, cons of ATI, NVIDIA architectures? Others?-Do we need to reorganize algorithms differently for each, or does one algorithm (perhaps with different block sizes, other parameters) work for both (which would be simpler)?•Other BLAS-like operations on GPU-Needed for finite-element analysis03/04/2009 CS267 Lecture 12a 7Missing Drivers in Sca/LAPACKLAPACK ScaLAPACKLinear EquationsLUCholeskyLDLTxGESVxPOSVxSYSVPxGESVPxPOSVmissingLeast Squares (LS)QRQR+pivotSVD/QRSVD/D&CSVD/MRRRQR + iterative refine.xGELSxGELSYxGELSSxGELSDmissing (oops)missingPxGELSmissing drivermissing drivermissing (intent)missing(oops)missingGeneralized LS LS + equality constr.Generalized LMAbove + Iterative ref.xGGLSExGGGLMmissingmissingmissingmissingMore missing driversLAPACK ScaLAPACKSymmetric EVD QR / Bisection+InvitD&CMRRRxSYEV / XxSYEVDxSYEVRPxSYEV / Xmissing (intent)missingNonsymmetric EVD Schur formVectors tooxGEES / XxGEEV /Xmissing drivermissing driverSVD QRD&CMRRRJacobixGESVDxGESDDmissing(oops)xGESVJPxGESVDmissing (intent)missing(oops)missingGeneralized Symmetric EVDQR / Bisection+InvitD&CMRRRxSYGV / XxSYGVDmissingPxSYGV / Xmissing (intent)missingGeneralized Nonsymmetric EVDSchur formVectors tooxGGES / XxGGEV / XmissingmissingGeneralized SVD KogbetliantzMRRRxGGSVDmissing(oops)missing (intent)missing(oops)Missing matrix types in ScaLAPACK•Symmetric, Hermitian, triangular-Band, Packed •Positive Definite-Packed•Orthogonal, Unitary-Packed0102030405060708090100seconds100020003000400050006000700080009000100001x602x303x204x155x126x10problem sizegrid shapeExecution time of PDGESV for various grid shape90-10080-9070-8060-7050-6040-5030-4020-3010-200-10Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster

View Full Document

Berkeley COMPSCI C267 - Dense Linear Algebra: Possible Class Projects

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Berkeley COMPSCI C267 - Dense Linear Algebra: Possible Class Projects

Sign up for free to view:

Please select your school