Berkeley COMPSCI C267 - Parallel Matrix Multiplication - D677819

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Parallel Matrix Multiplication

DOC PREVIEW

Berkeley COMPSCI C267 - Parallel Matrix Multiplication

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 60

This preview shows page 1-2-3-4-28-29-30-31-57-58-59-60 out of 60 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 267 Parallel Matrix MultiplicationParallel Numerical AlgorithmsParallel Vector OperationsBroadcast and reductionBroadcast AlgorithmsLower Bound on Parallel PerformanceScan (or Parallel prefix), A DigressionApplications of scansPowerPoint PresentationSlide 10Implementing ScansE.g., Using Scans for Array CompressionE.g., Fibonacci via Matrix Multiply PrefixSegmented OperationsThe “Myth” of log nEnd of DigressionParallel Matrix-Vector ProductMatrix-Vector ProductParallel Matrix MultiplyLatency Bandwidth ModelMatrix Multiply with 1D Column LayoutMatrix Multiply: 1D Layout on Bus or RingMatMul: 1D layout on Bus without BroadcastNaïve MatMul (continued)Matmul for 1D layout on a Processor RingSlide 26MatMul with 2D LayoutCannon’s AlgorithmCannon’s Matrix MultiplicationInitial Step to Skew Matrices in CannonSkewing Steps in CannonCost of Cannon’s AlgorithmDrawbacks to CannonSUMMA AlgorithmSUMMASlide 36SUMMA performanceSlide 38Slide 39Slide 40Recursive LayoutsSummary of Parallel Matrix MultiplicationExtra SlidesGaussian EliminationGaussian Elimination via a Recursive AlgorithmRecursive FactorizationsSlide 47Review: BLAS 3 (Blocked) GEPPSlide 49Slide 50Slide 51Slide 52Slide 53Slide 54Slide 55Slide 56Slide 57Slide 58A small software project ...Work-Depth Model of Parallelism01/14/19 CS267 Lecure 13 1CS 267Parallel Matrix MultiplicationKathy Yelickhttp://www.cs.berkeley.edu/~yelick/cs26701/14/19 CS267 Lecure 13 2Parallel Numerical Algorithms•Lecture schedule:•3/8: Dense Matrix Products•BLAS 1: Vector operations•BLAS 2: Matrix-Vector operations•BLAS 3: Matrix-Matrix operations•Use of Performance models in algorithm design•3/10: Dense Matrix Solvers•3/12: Matrix Multiply context results (HW1)•310 Soda at 1:30pm•3/15: Sparse Matrix Products •3/17: Sparse Direct Solvers01/14/19 CS267 Lecure 13 3Parallel Vector OperationsSome common vector operations for vectors x,y,z:•Vector add: z = x + y •Trivial to parallelize if vectors are aligned•AXPY: z = a*x+y (where a is scalar)•Broadcast a, followed by independent * and +•Dot product: s = j x[j] * y[j]•Independent * followed by + reduction01/14/19 CS267 Lecure 13 4Broadcast and reduction•Broadcast of 1 value to p processors in log p time•Reduction of p values to 1 in log p time•Takes advantage of associativity in +,*, min, max, etc.v8 1 3 1 0 4 -6 3 2Add-reductionBroadcast01/14/19 CS267 Lecure 13 5Broadcast Algorithms•Sequential or “centralized” algorithm•P0 sends value to P-1 other processors in sequence•O(P) algorithm•Note: variations in UPC/Titanium model based on whether P0 writes to all others, or others read from P0•Tree-based algorithm•May vary branching factor•O(log P) algorithm•If broadcasting large data blocks, may break into pieces and pipelinevBroadcastP0P4P6P0 P1 P2 P3 P4 P5 P6 P7P2P001/14/19 CS267 Lecure 13 6Lower Bound on Parallel Performance•To compute a function of n inputs x1,…xn•Given only binary operations on our machine.•In 1 time step, output depends on at most 2 inputs•In 2 time steps, output depends on at most 4 inputs•Adding a time step increases possible inputs by at most 2x•In k=log n time steps, output depends on at most n inputs A function of n inputs requires at least log n parallel steps.f(x1,x2,…xn)x1 x2 xnf(x1,x2,…xn)x1 x2 xn01/14/19 CS267 Lecure 13 7Scan (or Parallel prefix), A Digression•What if you want to compute partial sums•Definition: the parallel prefix operation take a binary associative operator , and an array of n elements [a0, a1, a2, … an-1] and produces the array [a0, (a0 a1), … (a0 a1 ... an-1)]•Example: add scan of [1, 2, 0, 4, 2, 1, 1, 3] is [1, 3, 3, 7, 9, 10, 11, 14]•Can be implemented in O(n) time by a serial algorithm•Obvious n-1 applications of operator will work01/14/19 CS267 Lecure 13 8Applications of scans•There are several applications of scans, some more obvious than others•lexically compare string of characters•add multi-precision numbers (represented as array of numbers)•evaluate polynomials•implement bucket sort and radix sort•solve tridiagonal systems•to dynamically allocate processors•to search for regular expression (e.g., grep)01/14/19 CS267 Lecure 13 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3 7 11 15 19 23 27 31 (Recursively Prefix) 3 10 21 36 55 78 105 136 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136Prefix Sum in parallelAlgorithm: 1. Pairwise sum 2. Recursively Prefix 3. Pairwise SumSlide source: Alan Edelman, MIT01/14/19 CS267 Lecure 13 10• Parallel prefix works on any associative operator 1 2 3 4 5 6 7 8 Pairwise sums 3 7 11 15 Recursive prefix 3 10 21 36 Update “odds” 1 3 6 10 15 21 28 36•Names: +\ (APL), cumsum(Matlab), MPI_SCAN•Warning: 2n operations used when only n-1 neededParallel Prefix CostSlide source: Alan Edelman, MIT01/14/19 CS267 Lecure 13 11Implementing Scans•Tree summation 2 phases•up sweep•get values L and R from left and right child•save L in local variable Mine•compute Tmp = L + R and pass to parent•down sweep•get value Tmp from parent•send Tmp to left child•send Tmp+Mine to right child6543 2 4 1Up sweep: mine = left tmp = left + right46 95 43 1 2 0 4 1 1 36543 2 4 10600 33 4 6 6 10 11 12 15+X = 3 1 2 0 4 1 1 344 6 6 10 116 1112Down sweep: tmp = parent (root is 0) right = tmp + mine01/14/19 CS267 Lecure 13 12E.g., Using Scans for Array Compression•Given an array of n elements [a0, a1, a2, … an-1] and an array of flags [1,0,1,1,0,0,1,…] compress the flagged elements [a0, a2, a3, a6, …]•Compute a “prescan” i.e., a scan that doesn’t include the element at position i in the sum [0,1,1,2,3,3,4,…]•Gives the index of the ith element in the compressed array•If the flag for this element is 1, write it into the result array at the given position01/14/19 CS267 Lecure 13 13E.g., Fibonacci via Matrix Multiply PrefixFn+1 = Fn + Fn-11-nnn1nFF 0111FFCan compute all Fn by matmul_prefix on [ , , , , , ,

View Full Document

Berkeley COMPSCI C267 - Parallel Matrix Multiplication

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-28-29-30-31-57-58-59-60 out of 60 pages.

Berkeley COMPSCI C267 - Parallel Matrix Multiplication

Sign up for free to view:

Please select your school