Berkeley COMPSCI C267 - Lecture 14: Parallel Matrix Multiplication - D2177975

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Lecture 14: Parallel Matrix Multiplication

DOC PREVIEW

Berkeley COMPSCI C267 - Lecture 14: Parallel Matrix Multiplication

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 51

This preview shows page 1-2-3-24-25-26-27-49-50-51 out of 51 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 267: Applications of Parallel Computers Lecture 14: Parallel Matrix MultiplicationAdministriviaParallel Vector OperationsBroadcast and reductionBroadcast AlgorithmsLower Bound on ParallelismScan (or Parallel prefix), A DigressionApplications of scansE.g., Using Scans for Array CompressionImplementing ScansImplementing Scans: ExampleParallel Matrix-Vector ProductMatrix-Vector ProductParallel Matrix MultiplyLatency Bandwidth ModelMatrix Multiply with 1D Column LayoutMatrix Multiply: 1D Layout on Bus or RingMatMul: 1D layout on Bus without BroadcastNaïve MatMul (continued)Matmul for 1D layout on a Processor RingSlide 21MatMul with 2D LayoutCannon’s AlgorithmPowerPoint PresentationInitial Step to Skew Matrices in CannonSkewing Steps in CannonCost of Cannon’s AlgorithmDrawbacks to CannonSUMMA AlgorithmSUMMASlide 31SUMMA performanceSlide 33Slide 34Slide 35Recursive LayoutsSummary of Parallel Matrix MultiplicationExtra SlidesReview: BLAS 3 (Blocked) GEPPSlide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49A small software project ...Work-Depth Model of Parallelism01/14/19 CS267, Yelick 1CS 267: Applications of Parallel ComputersLecture 14:Parallel Matrix MultiplicationKathy Yelickhttp://www-inst.eecs.berkeley.edu/~cs26701/14/19 CS267, Yelick 2Administrivia•Lecture schedule:•10/15: Dense Matrix Products•BLAS 1: Vector operations•BLAS 2: Matrix-Vector operations•BLAS 3: Matrix-Matrix operations•Use of Performance models in algorithm design•10/17: Sparse Matrix Products•10/22: Dense Matrix Solvers (Jim Demmel)•10/24: Sparse Direct Solvers (Xiaoye Li)01/14/19 CS267, Yelick 3Parallel Vector OperationsSome common vector operations for vectors x,y,z:•Vector add: z = x + y •Trivial to parallelize if vectors are aligned•AXPY: z = a*x+y (where a is scalar)•Broadcast a, followed by independent * and +•Dot product: s = j x[j] * y[j]•Independent * followed by + reduction01/14/19 CS267, Yelick 4Broadcast and reduction•Broadcast of 1 value to p processors in log p time•Reduction of p values to 1 in log p time•Takes advantage of associativity in +,*, min, max, etc.58 1 3 1 0 4 -6 3 2Add-reductionBroadcast01/14/19 CS267, Yelick 5Broadcast Algorithms•Sequential or “centralized” algorithm•P0 sends value to P-1 other processors in sequence•O(P) algorithm•Note: variations in UPC/Titanium model based on whether P0 writes to all others, or others read from P0•Tree-based algorithm•May vary branching factor•O(log P) algorithm•If broadcasting large data blocks, may break into pieces and pipeline5BroadcastP0P4P6P0 P1 P2 P3 P4 P5 P6 P701/14/19 CS267, Yelick 6Lower Bound on Parallelism•Given only binary operations on our machine.•In 1 time step, only 1 operation can be performed•output depends on at most 2 inputs•In 2 time steps, output depends on at most 4 inputs•Adding a time step increases possible inputs by at most 2x•In k=log n time steps, output depends on at most n=2k inputs•A function of n inputs requires at least log n parallel steps.01/14/19 CS267, Yelick 7Scan (or Parallel prefix), A Digression•Definition: the Scan operation take a binary associative operator , and an array of n elements [a0, a1, a2, … an-1] and produces the array [a0, (a0 a1), … (a0 a1 ... an-1)]•Example: add scan of [1, 2, 0, 4, 2, 1, 1, 3] is [1, 3, 3, 7, 9, 10, 11, 14]•Can be implemented in O(n) time by a serial algorithm•Generalized to “segmented scans” given a mask of positions at which total is reset.01/14/19 CS267, Yelick 8Applications of scans•There are several applications of scans, some more obvious than others•lexically compare string of characters•add multi-precision numbers (represented as array of numbers)•evaluate polynomials•implement bucket sort and radix sort•solve tridiagonal systems•to dynamically allocate processors•to search for regular expression (e.g., grep)01/14/19 CS267, Yelick 9E.g., Using Scans for Array Compression•Given an array of n elements [a0, a1, a2, … an-1] and an array of flags [1,0,1,1,0,0,1,…] compress the flagged elements [a0, a2, a3, a6, …]•Compute a “prescan” i.e., a scan that doesn’t include the element at position i in the sum [0,1,1,2,3,3,4,…]•Gives the index of the ith element in the compressed array01/14/19 CS267, Yelick 10Implementing Scansa0anbroadcast valuestake sum•Naïve approach: use multiple trees to calculate partial sums•works in O(log n) time•but uses roughly n2/2 processors•Tree summation 2 phases•up sweep•get values L and R from left and right child•save L in local variable Mine•compute Tmp = L + R and pass to parent•down sweep•get value Tmp from parent•send Tmp to left child•send Tmp+Mine to right child01/14/19 CS267, Yelick 11Implementing Scans: Example•Tree summation is used to efficiently implement scans6543 2 4 16543 2 4 1Up sweep: mine = left tmp = left + rightDown sweep: tmp = parent (root is 0) right = tmp + mine46 95 40600 33 4 6 6 10 11 12 153 1 2 0 4 1 1 3+X = 3 1 2 0 4 1 1 344 6 6 10 116 111201/14/19 CS267, Yelick 12Parallel Matrix-Vector Product•y = y + A*x, where A is a dense matrix•Layout: •1D by rows•Algorithm:•Foreach processor j• Broadcast X(j)• Compute A(p)*x(j)•A(i) refers to the n by n/p block row that processor i owns•Algorithm uses the formulaY(i) = Y(i) + A(i)*X = Y(i) + j A(i)*X(j)xyPoP1P2P3Po P1 P2 P301/14/19 CS267, Yelick 13Matrix-Vector Product•A column layout of the matrix eliminates the broadcast•But adds a reduction to update the destination•A blocked layout uses a broadcast and reduction, both on a subset of processorsP0 P1 P2 P3P0 P1 P2 P3P4 P5 P6 P7P8 P9 P10 P11P12 P13 P14 P1501/14/19 CS267, Yelick 14Parallel Matrix Multiply•Computing C=C+A*B•Using basic algorithm: 2*n3 Flops•Variables are:•Data layout•Topology of machine •Scheduling communication•Use of performance models for algorithm designTime (message) = latency + #words * time-per-word =  + n*01/14/19 CS267, Yelick 15Latency Bandwidth Model•Network of fixed number P of processors•fully connected•each with local memory•Latency ()•accounts for varying performance with number of messages•Inverse bandwidth ( )•accounts for

View Full Document

Berkeley COMPSCI C267 - Lecture 14: Parallel Matrix Multiplication

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-24-25-26-27-49-50-51 out of 51 pages.

Berkeley COMPSCI C267 - Lecture 14: Parallel Matrix Multiplication

Sign up for free to view:

Please select your school