CS 267 Dense Linear Algebra: Possible Class ProjectsKinds of class projectsChallenges to Libraries (and parallel SW in general)Strassen’s Matmul on Multicore or GPUReview: Alternative recursive GE formulationRegister-file resident Linear Algebra on GPUsExtend Vasily’s GPU analysis, code to ATIMissing Drivers in Sca/LAPACKMore missing driversMissing matrix types in ScaLAPACKSlide 11Slide 12Parallel Eigenvalue Algorithms on GPUExperiment with PLASMA for MulticoreFork-Join vs. Dynamic Execution on MulticoreSlide 16Investigate role of “Dense Motif” in ParLab Apps03/04/2009 CS267 Lecture 12a 1CS 267 Dense Linear Algebra:Possible Class ProjectsJames Demmelwww.cs.berkeley.edu/~demmel/cs267_Spr0903/04/2009 CS267 Lecture 12a 2Kinds of class projects•Try tuning existing (widely used) codes in LAPACK, ScaLAPACK or possible future versions-Possible impact: help many people to run faster•Add missing functionality to these libraries-Possible impact: lots of users want it•Experiment with algorithms on new architectures-Possible impact: What do we need to do differently for performance on these platforms? Are there any bottlenecks or other problems in the architecture? Could they be fixed?•Experiment with new software approaches-Possible impact: Is it easier to write these algorithms while getting most of the performance? Should we produce future versions of the libraries this way?•Experiment with new algorithms-Possible impact: Find a better one!Challenges to Libraries (and parallel SW in general)•Minimizing communication costs-Cost of bandwidth and latency (to main memory or over a network) growing exponentially compared to arithmetic•Heterogeneous platforms-Different communication costs depending on destination•Same chip vs different socket vs different board …-CPU + GPU•Perform different operations at very different rates•Dynamic scheduling & load balancing-Can’t always assume each core/processor makes constant progress on your task-May be faster to grab next available task than use predesigned “perfectly balanced” schedule-OS may give, take away resources on the fly•Fault tolerance – how to recover when one proc fails03/02/2009 CS267 Lecture 11 3Strassen’s Matmul on Multicore or GPU•Why no Strassen in most libraries?-See “Baleful Effect of Benchmarks…” by Prof. Kahan•Likely to be faster for modest-to-large matrix sizes-Where is the crossover?•May want hybrid: switch to O(n3) algorithm for certain sizes (smaller)-Autotuning?•Lots of “blocking” opportunities as for standard matmul-What is least amount of data movement possible?•How well does it work for the rectangular matmuls in LU, QR and Cholesky?-Do we need to modify LU, QR or Cholesky to take advantage of Strassen (by using a variant that multiplies different size matrices)?03/04/2009 CS267 Lecture 12a 4Review: Alternative recursive GE formulation • Toledo (1997) -Describe without pivoting for simplicity-“Do left half of matrix, then right half”03/04/2009 CS267 Lecture 12a 5 function [L,U] = RLU (A) … assume A is m by n if (n=1) L = A/A(1,1), U = A(1,1) else [L1,U1] = RLU( A(1:m , 1:n/2)) … do left half of A … let L11 denote top n/2 rows of L1 A( 1:n/2 , n/2+1 : n ) = L11-1 * A( 1:n/2 , n/2+1 : n ) … update top n/2 rows of right half of A A( n/2+1: m, n/2+1:n ) = A( n/2+1: m, n/2+1:n ) - A( n/2+1: m, 1:n/2 ) * A( 1:n/2 , n/2+1 : n ) … update rest of right half of A [L2,U2] = RLU( A(n/2+1:m , n/2+1:n) ) … do right half of A return [ L1,[0;L2] ] and [U1, [ A(.,.) ; U2 ] ]A = L * URegister-file resident Linear Algebra on GPUs•Vasily’s results for LU, QR and Cholesky on GPU target single large matrices, too large to fit just in the “fast memory” (shared + registers) of the GPU•There is also demand for solving many smaller problems in parallel, eg A(i) * x(i) = b(i) for many different A(1),…,A(k) and b(1),…,b(k)•Project: Design linear algebra algorithms that operate on many different matrices in parallel, each small enough to fit in the 64 KB register set of each multiprocessor -single precision square matrix of dimension n=128•Question: Does possible need to branch differently on each multiprocessor (because of different pivot orders) matter? If so, is QR better than LU?•Question: Do we need BLAS3 code versions on such small matrices, or is BLAS2 enough?03/04/2009 CS267 Lecture 12a 6Extend Vasily’s GPU analysis, code to ATI•Vasily’s Best Student Paper Award from SC08 had two parts:-Analyzed bottlenecks, speedup possibilities in NVIDIA architecture-Applied lessons to reorganization of LU, QR, Cholesky•What about ATI GPU?-Both above aspects interesting-ATI GPU available in ParLab-What are pros, cons of ATI, NVIDIA architectures? Others?-Do we need to reorganize algorithms differently for each, or does one algorithm (perhaps with different block sizes, other parameters) work for both (which would be simpler)?•Other BLAS-like operations on GPU-Needed for finite-element analysis03/04/2009 CS267 Lecture 12a 7Missing Drivers in Sca/LAPACKLAPACK ScaLAPACKLinear EquationsLUCholeskyLDLTxGESVxPOSVxSYSVPxGESVPxPOSVmissingLeast Squares (LS)QRQR+pivotSVD/QRSVD/D&CSVD/MRRRQR + iterative refine.xGELSxGELSYxGELSSxGELSDmissing (oops)missingPxGELSmissing drivermissing drivermissing (intent)missing(oops)missingGeneralized LS LS + equality constr.Generalized LMAbove + Iterative ref.xGGLSExGGGLMmissingmissingmissingmissingMore missing driversLAPACK ScaLAPACKSymmetric EVD QR / Bisection+InvitD&CMRRRxSYEV / XxSYEVDxSYEVRPxSYEV / Xmissing (intent)missingNonsymmetric EVD Schur formVectors tooxGEES / XxGEEV /Xmissing drivermissing driverSVD QRD&CMRRRJacobixGESVDxGESDDmissing(oops)xGESVJPxGESVDmissing (intent)missing(oops)missingGeneralized Symmetric EVDQR / Bisection+InvitD&CMRRRxSYGV / XxSYGVDmissingPxSYGV / Xmissing (intent)missingGeneralized Nonsymmetric EVDSchur formVectors tooxGGES / XxGGEV / XmissingmissingGeneralized SVD KogbetliantzMRRRxGGSVDmissing(oops)missing (intent)missing(oops)Missing matrix types in ScaLAPACK•Symmetric, Hermitian, triangular-Band, Packed •Positive Definite-Packed•Orthogonal, Unitary-Packed0102030405060708090100seconds100020003000400050006000700080009000100001x602x303x204x155x126x10problem sizegrid shapeExecution time of PDGESV for various grid shape90-10080-9070-8060-7050-6040-5030-4020-3010-200-10Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster
View Full Document