Avoiding Communication in Linear Algebra

Home> Academic Documents> Avoiding Communication in Linear Algebra

DOC PREVIEW

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Avoiding Communication in Linear AlgebraMotivationSlide 3Slide 4OutlineSlide 6Collaborators (so far)Why all our problems are solved for dense linear algebra– in theorySummary (1) – Avoiding Communication in Dense Linear AlgebraMinimizing Comm. in Parallel QRTSQR in more detailMinimizing Communication in TSQRPerformance of TSQR vs Sca/LAPACKQR for General MatricesModeled Speedups of CAQR vs ScaLAPACKTSLU: LU factorization of a tall skinny matrixGrowth factor for TSLU based factorizationMaking TSLU StableGrowth factor for better CALU approachPerformance vs ScaLAPACKSpeedup prediction for a Petascale machine - up to 81x fasterSummary (2) – Avoiding Communication in Sparse Linear AlgebraSlide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Performance ResultsOptimizing Communication Complexity of Sparse SolversMinimizing Communication of GMRESSlide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Summary and Conclusions (1/2)Summary and Conclusions (2/2)Avoiding Communicationin Linear AlgebraJim DemmelUC Berkeleybebop.cs.berkeley.eduMotivation •Running time of an algorithm is sum of 3 terms:•# flops * time_per_flop•# words moved / bandwidth•# messages * latency communicationMotivation •Running time of an algorithm is sum of 3 terms:•# flops * time_per_flop•# words moved / bandwidth•# messages * latency•Exponentially growing gaps between•Time_per_flop << 1/Network BW << Network Latency•Improving 59%/year vs 26%/year vs 15%/year•Time_per_flop << 1/Memory BW << Memory Latency•Improving 59%/year vs 23%/year vs 5.5%/year communicationMotivation •Running time of an algorithm is sum of 3 terms:•# flops * time_per_flop•# words moved / bandwidth•# messages * latency•Exponentially growing gaps between•Time_per_flop << 1/Network BW << Network Latency•Improving 59%/year vs 26%/year vs 15%/year•Time_per_flop << 1/Memory BW << Memory Latency•Improving 59%/year vs 23%/year vs 5.5%/year•Goal : reorganize linear algebra to avoid communication•Not just hiding communication (speedup  2x ) •Arbitrary speedups possible communicationOutline•Motivation•Avoiding Communication in Dense Linear Algebra•Avoiding Communication in Sparse Linear AlgebraOutline•Motivation•Avoiding Communication in Dense Linear Algebra•Avoiding Communication in Sparse Linear Algebra•A poem in memory of Gene Golub (separate file)Collaborators (so far)•UC Berkeley–Kathy Yelick, Ming Gu–Mark Hoemmen, Marghoob Mohiyuddin, Kaushik Datta, George Petropoulos, Sam Williams, BeBOp group–Lenny Oliker, John Shalf•CU Denver –Julien Langou•INRIA–Laura Grigori, Hua Xiang•Much related work–Complete references in technical reportsWhy all our problems are solved for dense linear algebra– in theory•(Talk by Ioana Dumitriu on Monday)•Thm (D., Dumitriu, Holtz, Kleinberg) (Numer.Math. 2007)–Given any matmul running in O(n) ops for some >2, it can be made stable and still run in O(n+) ops, for any >0.•Current record:   2.38•Thm (D., Dumitriu, Holtz) (Numer. Math. 2008)–Given any stable matmul running in O(n+) ops, it is possible to do backward stable dense linear algebra in O(n+) ops:•GEPP, QR •rank revealing QR (randomized)•(Generalized) Schur decomposition, SVD (randomized)•Also reduces communication to O(n+) •But constants?8Summary (1) – Avoiding Communication in Dense Linear Algebra•QR or LU decomposition of m x n matrix, m >> n–Parallel implementation•Conventional: O( n log p ) messages•“New”: O( log p ) messages - optimal–Serial implementation with fast memory of size F•Conventional: O( mn/F ) moves of data from slow to fast memory–mn/F = how many times larger matrix is than fast memory•“New”: O(1) moves of data - optimal–Lots of speed up possible (measured and modeled) •Price: some redundant computation, stability?•Extends to square case, with optimality results•Extends to other architectures (eg multicore)•(Talk by Julien Langou Monday, on QR)Minimizing Comm. in Parallel QRW = W0W1W2W3R00R10R20R30R01R11R02• QR decomposition of m x n matrix W, m >> n• TSQR = “Tall Skinny QR”• P processors, block row layout• Usual Parallel Algorithm• Compute Householder vector for each column• Number of messages  n log P• Communication Avoiding Algorithm• Reduction operation, with QR as operator• Number of messages  log PTSQR in more detail 30201000302010003210.RRRRQQQQWWWWW1101110130201000.RRQQRRRR02021101RQRRQ is represented implicitly as a product (tree of factors)Minimizing Communication in TSQRW = W0W1W2W3R00R10R20R30R01R11R02Parallel:W = W0W1W2W3R01R02R00R03Sequential:W = W0W1W2W3R00R01R01R11R02R11R03Dual Core:Choose reduction tree dynamicallyMulticore / Multisocket / Multirack / Multisite / Out-of-core: ?Performance of TSQR vs Sca/LAPACK•Parallel–Pentium III cluster, Dolphin Interconnect, MPICH•Up to 6.7x speedup (16 procs, 100K x 200)–BlueGene/L•Up to 4x speedup (32 procs, 1M x 50)–Both use Elmroth-Gustavson locally – enabled by TSQR•Sequential –OOC on PowerPC laptop•As little as 2x slowdown vs (predicted) infinite DRAM•See UC Berkeley EECS Tech Report 2008-74QR for General Matrices•CAQR – Communication Avoiding QR for general A–Use TSQR for panel factorizations–Apply to rest of matrix•Cost of CAQR vs ScaLAPACK’s PDGEQRF–n x n matrix on P1/2 x P1/2 processor grid, block size b–Flops: (4/3)n3/P + (3/4)n2b log P/P1/2 vs (4/3)n3/P –Bandwidth: (3/4)n2 log P/P1/2 vs same–Latency: 2.5 n log P / b vs 1.5 n log P•Close to optimal (modulo log P factors)–Assume: O(n2/P) memory/processor, O(n3) algorithm, –Choose b near n / P1/2 (its upper bound)–Bandwidth lower bound: (n2 /P1/2) – just log(P) smaller–Latency lower bound: (P1/2) – just polylog(P) smaller–Extension of Irony/Toledo/Tishkin (2004)•Implementation – Julien’s summer projectModeled Speedups of CAQR vs ScaLAPACKPetascale up to 22.9xIBM Power 5 up to 9.7x“Grid” up to 11x Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Please select your school