DOC PREVIEW
U of U CS 7810 - Lecture 14 - Parallel Algorithms

This preview shows page 1-2-3-22-23-24-44-45-46 out of 46 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide Number 1Slide Number 2Slide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Slide Number 8Slide Number 9Slide Number 10Slide Number 11Slide Number 12Slide Number 13Slide Number 14Slide Number 15Slide Number 16Slide Number 17Slide Number 18Slide Number 19Slide Number 20Slide Number 21Slide Number 22Slide Number 23Slide Number 24Slide Number 25Slide Number 26Slide Number 27Slide Number 28Slide Number 29Slide Number 30Slide Number 31Slide Number 32Slide Number 33Slide Number 34Slide Number 35Slide Number 36Slide Number 37Slide Number 38Slide Number 39Slide Number 40Slide Number 41Slide Number 42Slide Number 43Slide Number 44Slide Number 45Slide Number 461Lecture 14: Parallel Algorithms• Topics: sort, matrix, graph algorithms2Processor Model• High communication latencies Æ pursue coarse-grainparallelism (the focus of the course so far)• For upcoming lectures, focus on fine-grain parallelism• VLSI improvements Æ enough transistors to accommodatenumerous processing units on a chip and (relatively) lowcommunication latencies• Consider a special-purpose processor with thousands ofprocessing units, each with small-bit ALUs and limitedregister storage3Sorting on a Linear Array• Each processor has bidirectional links to its neighbors• All processors share a single clock (asynchronous designswill require minor modifications)• At each clock, processors receive inputs from neighbors,perform computations, generate output for neighbors, andupdate local storageinputoutput4Control at Each Processor• Each processor stores the minimum number it has seen• Initial value in storage and on network is “∗”, which isbigger than any input and also means “no signal”• On receiving number Y from left neighbor, the processorkeeps the smaller of Y and current storage Z, and passesthe larger to the right neighbor5Sorting Example6Result Output• The output process begins when a processor receivesa non-∗, followed by a “∗”• Each processor forwards its storage to its left neighborand subsequent data it receives from right neighbors• How many steps does it take to sort N numbers?• What is the speedup and efficiency?7Output Example8Bit Model• The bit model affords a more precise measure ofcomplexity – we will now assume that each processorcan only operate on a bit at a time• To compare N k-bit words, you may now need an N x k2-d array of bit processors9Comparison Strategies• Strategy 1: Bits travel horizontally, keep/swap signalstravel vertically – after at most 2k steps, each processorknows which number must be moved to the right – 2kNsteps in the worst case• Strategy 2: Use a tree to communicate information onwhich number is greater – after 2logk steps, each processorknows which number must be moved to the right – 2Nlogksteps• Can we do better?10Strategy 2: Column of Trees11Pipelined ComparisonInput numbers: 3 4 20 1 01 0 11 0 012Complexity• How long does it take to sort N k-bit numbers?(2N – 1) + (k – 1) + N (for output)• (With a 2d array of processors) Can we do even better? • How do we prove optimality?13Lower Bounds• Input/Output bandwidth: Nk bits are being input/outputwith k pins – requires Ω(N) time• Diameter: the comparison at processor (1,1) influencesthe value of the bit stored at processor (N,k) – forexample, N-1 numbers are 011..1 and the last number iseither 00…0 or 10…0 – it takes at least N+k-2 steps forinformation to travel across the diameter• Bisection width: if processors in one half require the results computed by the other half, the bisection bandwidthimposes a minimum completion time14Counter Example• N 1-bit numbers that need to be sorted with a binary tree• Since bisection bandwidth is 2 and each number may bein the wrong half, will any algorithm take at least N/2 steps?15Counting Algorithm• It takes O(logN) time for each intermediate node to addthe contents in the subtree and forward the result to theparent, one bit at a time• After the root has computed the number of 1’s, thisnumber is communicated to the leaves – the leavesaccordingly set their output to 0 or 1• Each half only needs to know the number of 1’s in theother half (logN-1 bits) – therefore, the algorithm takesΩ(logN) time• Careful when estimating lower bounds!16Matrix Algorithms• Consider matrix-vector multiplication:yi = Σjaijxj• The sequential algorithm takes 2N2– N operations• With an N-cell linear array, can we implementmatrix-vector multiplication in O(N) time?17Matrix Vector MultiplicationNumber of steps = ?18Matrix Vector MultiplicationNumber of steps = 2N – 119Matrix-Matrix MultiplicationNumber of time steps = ?20Matrix-Matrix MultiplicationNumber of time steps = 3N – 221Complexity• The algorithm implementations on the linear arrays havespeedups that are linear in the number of processors – anefficiency of O(1)• It is possible to improve these algorithms by a constantfactor, for example, by inputting values directly to eachprocessor in the first step and providing wraparound edges(N time steps)22Solving Systems of Equations• Given an N x N lower triangular matrix A and an N-vectorb, solve for x, where Ax = b (assume solution exists)a11x1= b1 a21x1+ a22x2= b2 , and so on…23Equation Solver24Equation Solver Example• When an x, b, and a meet at a cell, ax is subtracted from b• When b and a meet at cell 1, b is divided by a to become x25Complexity• Time steps = 2N – 1• Speedup = O(N), efficiency = O(1)• Note that half the processors are idle every time step –can improve efficiency by solving two interleavedequation systems simultaneously26Gaussian Elimination• Solving for x, where Ax=b and A is a nonsingular matrix• Note that A-1Ax = A-1b = x ; keep applying transformationsto A such that A becomes I ; the same transformations applied to b will result in the solution for x• Sequential algorithm steps: Pick a row where the first (ith) element is non-zero andnormalize the row so that the first (ith) element is 1 Subtract a multiple of this row from all other rows sothat their first (ith) element is zero Repeat for all i27Sequential Example2 4 -7 x1 33 6 -10 x2 = 4-1 3 -4 x3 61 2 -7/2 x1 3/23 6 -10 x2 = 4-1 3 -4 x3 61 2 -7/2 x1 3/20 0 1/2 x2 = -1/2-1 3 -4 x3 61 2 -7/2 x1


View Full Document

U of U CS 7810 - Lecture 14 - Parallel Algorithms

Download Lecture 14 - Parallel Algorithms
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 14 - Parallel Algorithms and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 14 - Parallel Algorithms 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?