DOC PREVIEW
Berkeley COMPSCI C267 - Parallel Matrix Multiplication

This preview shows page 1-2-3-4-28-29-30-31-57-58-59-60 out of 60 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 60 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 267 Parallel Matrix MultiplicationParallel Numerical AlgorithmsParallel Vector OperationsBroadcast and reductionBroadcast AlgorithmsLower Bound on Parallel PerformanceScan (or Parallel prefix), A DigressionApplications of scansPowerPoint PresentationSlide 10Implementing ScansE.g., Using Scans for Array CompressionE.g., Fibonacci via Matrix Multiply PrefixSegmented OperationsThe “Myth” of log nEnd of DigressionParallel Matrix-Vector ProductMatrix-Vector ProductParallel Matrix MultiplyLatency Bandwidth ModelMatrix Multiply with 1D Column LayoutMatrix Multiply: 1D Layout on Bus or RingMatMul: 1D layout on Bus without BroadcastNaïve MatMul (continued)Matmul for 1D layout on a Processor RingSlide 26MatMul with 2D LayoutCannon’s AlgorithmCannon’s Matrix MultiplicationInitial Step to Skew Matrices in CannonSkewing Steps in CannonCost of Cannon’s AlgorithmDrawbacks to CannonSUMMA AlgorithmSUMMASlide 36SUMMA performanceSlide 38Slide 39Slide 40Recursive LayoutsSummary of Parallel Matrix MultiplicationExtra SlidesGaussian EliminationGaussian Elimination via a Recursive AlgorithmRecursive FactorizationsSlide 47Review: BLAS 3 (Blocked) GEPPSlide 49Slide 50Slide 51Slide 52Slide 53Slide 54Slide 55Slide 56Slide 57Slide 58A small software project ...Work-Depth Model of Parallelism01/14/19 CS267 Lecure 13 1CS 267Parallel Matrix MultiplicationKathy Yelickhttp://www.cs.berkeley.edu/~yelick/cs26701/14/19 CS267 Lecure 13 2Parallel Numerical Algorithms•Lecture schedule:•3/8: Dense Matrix Products•BLAS 1: Vector operations•BLAS 2: Matrix-Vector operations•BLAS 3: Matrix-Matrix operations•Use of Performance models in algorithm design•3/10: Dense Matrix Solvers•3/12: Matrix Multiply context results (HW1)•310 Soda at 1:30pm•3/15: Sparse Matrix Products •3/17: Sparse Direct Solvers01/14/19 CS267 Lecure 13 3Parallel Vector OperationsSome common vector operations for vectors x,y,z:•Vector add: z = x + y •Trivial to parallelize if vectors are aligned•AXPY: z = a*x+y (where a is scalar)•Broadcast a, followed by independent * and +•Dot product: s = j x[j] * y[j]•Independent * followed by + reduction01/14/19 CS267 Lecure 13 4Broadcast and reduction•Broadcast of 1 value to p processors in log p time•Reduction of p values to 1 in log p time•Takes advantage of associativity in +,*, min, max, etc.v8 1 3 1 0 4 -6 3 2Add-reductionBroadcast01/14/19 CS267 Lecure 13 5Broadcast Algorithms•Sequential or “centralized” algorithm•P0 sends value to P-1 other processors in sequence•O(P) algorithm•Note: variations in UPC/Titanium model based on whether P0 writes to all others, or others read from P0•Tree-based algorithm•May vary branching factor•O(log P) algorithm•If broadcasting large data blocks, may break into pieces and pipelinevBroadcastP0P4P6P0 P1 P2 P3 P4 P5 P6 P7P2P001/14/19 CS267 Lecure 13 6Lower Bound on Parallel Performance•To compute a function of n inputs x1,…xn•Given only binary operations on our machine.•In 1 time step, output depends on at most 2 inputs•In 2 time steps, output depends on at most 4 inputs•Adding a time step increases possible inputs by at most 2x•In k=log n time steps, output depends on at most n inputs A function of n inputs requires at least log n parallel steps.f(x1,x2,…xn)x1 x2 xnf(x1,x2,…xn)x1 x2 xn01/14/19 CS267 Lecure 13 7Scan (or Parallel prefix), A Digression•What if you want to compute partial sums•Definition: the parallel prefix operation take a binary associative operator , and an array of n elements [a0, a1, a2, … an-1] and produces the array [a0, (a0 a1), … (a0 a1 ... an-1)]•Example: add scan of [1, 2, 0, 4, 2, 1, 1, 3] is [1, 3, 3, 7, 9, 10, 11, 14]•Can be implemented in O(n) time by a serial algorithm•Obvious n-1 applications of operator will work01/14/19 CS267 Lecure 13 8Applications of scans•There are several applications of scans, some more obvious than others•lexically compare string of characters•add multi-precision numbers (represented as array of numbers)•evaluate polynomials•implement bucket sort and radix sort•solve tridiagonal systems•to dynamically allocate processors•to search for regular expression (e.g., grep)01/14/19 CS267 Lecure 13 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3 7 11 15 19 23 27 31 (Recursively Prefix) 3 10 21 36 55 78 105 136 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136Prefix Sum in parallelAlgorithm: 1. Pairwise sum 2. Recursively Prefix 3. Pairwise SumSlide source: Alan Edelman, MIT01/14/19 CS267 Lecure 13 10• Parallel prefix works on any associative operator 1 2 3 4 5 6 7 8 Pairwise sums 3 7 11 15 Recursive prefix 3 10 21 36 Update “odds” 1 3 6 10 15 21 28 36•Names: +\ (APL), cumsum(Matlab), MPI_SCAN•Warning: 2n operations used when only n-1 neededParallel Prefix CostSlide source: Alan Edelman, MIT01/14/19 CS267 Lecure 13 11Implementing Scans•Tree summation 2 phases•up sweep•get values L and R from left and right child•save L in local variable Mine•compute Tmp = L + R and pass to parent•down sweep•get value Tmp from parent•send Tmp to left child•send Tmp+Mine to right child6543 2 4 1Up sweep: mine = left tmp = left + right46 95 43 1 2 0 4 1 1 36543 2 4 10600 33 4 6 6 10 11 12 15+X = 3 1 2 0 4 1 1 344 6 6 10 116 1112Down sweep: tmp = parent (root is 0) right = tmp + mine01/14/19 CS267 Lecure 13 12E.g., Using Scans for Array Compression•Given an array of n elements [a0, a1, a2, … an-1] and an array of flags [1,0,1,1,0,0,1,…] compress the flagged elements [a0, a2, a3, a6, …]•Compute a “prescan” i.e., a scan that doesn’t include the element at position i in the sum [0,1,1,2,3,3,4,…]•Gives the index of the ith element in the compressed array•If the flag for this element is 1, write it into the result array at the given position01/14/19 CS267 Lecure 13 13E.g., Fibonacci via Matrix Multiply PrefixFn+1 = Fn + Fn-11-nnn1nFF 0111FFCan compute all Fn by matmul_prefix on [ , , , , , ,


View Full Document

Berkeley COMPSCI C267 - Parallel Matrix Multiplication

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Parallel Matrix Multiplication
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Parallel Matrix Multiplication and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Parallel Matrix Multiplication 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?