New version page

U of U CS 7810 - Parallel Algorithms

This preview shows page 1-2-3-22-23-24-44-45-46 out of 46 pages.

View Full Document
View Full Document

End of preview. Want to read all 46 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 461Lecture 14: Parallel Algorithms• Topics: sort, matrix, graph algorithms2Processor Model• High communication latencies  pursue coarse-grain parallelism (the focus of the course so far)• For upcoming lectures, focus on fine-grain parallelism• VLSI improvements  enough transistors to accommodate numerous processing units on a chip and (relatively) low communication latencies• Consider a special-purpose processor with thousands of processing units, each with small-bit ALUs and limited register storage3Sorting on a Linear Array• Each processor has bidirectional links to its neighbors• All processors share a single clock (asynchronous designs will require minor modifications)• At each clock, processors receive inputs from neighbors, perform computations, generate output for neighbors, and update local storageinputoutput4Control at Each Processor• Each processor stores the minimum number it has seen• Initial value in storage and on network is “”, which is bigger than any input and also means “no signal”• On receiving number Y from left neighbor, the processor keeps the smaller of Y and current storage Z, and passes the larger to the right neighbor5Sorting Example6Result Output• The output process begins when a processor receives a non- , followed by a “”• Each processor forwards its storage to its left neighbor and subsequent data it receives from right neighbors• How many steps does it take to sort N numbers?• What is the speedup and efficiency?7Output Example8Bit Model• The bit model affords a more precise measure of complexity – we will now assume that each processor can only operate on a bit at a time• To compare N k-bit words, you may now need an N x k 2-d array of bit processors9Comparison Strategies• Strategy 1: Bits travel horizontally, keep/swap signals travel vertically – after at most 2k steps, each processor knows which number must be moved to the right – 2kN steps in the worst case• Strategy 2: Use a tree to communicate information on which number is greater – after 2logk steps, each processor knows which number must be moved to the right – 2Nlogk steps• Can we do better?10Strategy 2: Column of Trees11Pipelined ComparisonInput numbers: 3 4 2 0 1 0 1 0 1 1 0 012Complexity• How long does it take to sort N k-bit numbers? (2N – 1) + (k – 1) + N (for output)• (With a 2d array of processors) Can we do even better? • How do we prove optimality?13Lower Bounds• Input/Output bandwidth: Nk bits are being input/output with k pins – requires (N) time• Diameter: the comparison at processor (1,1) influences the value of the bit stored at processor (N,k) – for example, N-1 numbers are 011..1 and the last number is either 00…0 or 10…0 – it takes at least N+k-2 steps for information to travel across the diameter• Bisection width: if processors in one half require the results computed by the other half, the bisection bandwidth imposes a minimum completion time14Counter Example• N 1-bit numbers that need to be sorted with a binary tree• Since bisection bandwidth is 2 and each number may be in the wrong half, will any algorithm take at least N/2 steps?15Counting Algorithm• It takes O(logN) time for each intermediate node to add the contents in the subtree and forward the result to the parent, one bit at a time• After the root has computed the number of 1’s, this number is communicated to the leaves – the leaves accordingly set their output to 0 or 1• Each half only needs to know the number of 1’s in the other half (logN-1 bits) – therefore, the algorithm takes (logN) time• Careful when estimating lower bounds!16Matrix Algorithms• Consider matrix-vector multiplication: yi = j aijxj• The sequential algorithm takes 2N2 – N operations• With an N-cell linear array, can we implement matrix-vector multiplication in O(N) time?17Matrix Vector MultiplicationNumber of steps = ?18Matrix Vector MultiplicationNumber of steps = 2N – 119Matrix-Matrix MultiplicationNumber of time steps = ?20Matrix-Matrix MultiplicationNumber of time steps = 3N – 221Complexity• The algorithm implementations on the linear arrays have speedups that are linear in the number of processors – an efficiency of O(1)• It is possible to improve these algorithms by a constant factor, for example, by inputting values directly to each processor in the first step and providing wraparound edges (N time steps)22Solving Systems of Equations• Given an N x N lower triangular matrix A and an N-vector b, solve for x, where Ax = b (assume solution exists) a11x1 = b1 a21x1 + a22x2 = b2 , and so on…23Equation Solver24Equation Solver Example• When an x, b, and a meet at a cell, ax is subtracted from b• When b and a meet at cell 1, b is divided by a to become x25Complexity• Time steps = 2N – 1• Speedup = O(N), efficiency = O(1)• Note that half the processors are idle every time step – can improve efficiency by solving two interleaved equation systems simultaneously26Gaussian Elimination• Solving for x, where Ax=b and A is a nonsingular matrix• Note that A-1Ax = A-1b = x ; keep applying transformations to A such that A becomes I ; the same transformations applied to b will result in the solution for x• Sequential algorithm steps: Pick a row where the first (ith) element is non-zero and normalize the row so that the first (ith) element is 1 Subtract a multiple of this row from all other rows so that their first (ith) element is zero Repeat for all i27Sequential Example2 4 -7 x1 33 6 -10 x2 = 4-1 3 -4 x3 61 2 -7/2 x1 3/23 6 -10 x2 = 4-1 3 -4 x3 61 2 -7/2 x1 3/20 0 1/2 x2 = -1/2-1 3 -4 x3 61 2 -7/2 x1 3/20 0 1/2 x2 = -1/20 5 -15/2 x3 15/21 2 -7/2 x1 3/20 5 -15/2 x2 = 15/20 0 1/2 x3 -1/21 2 -7/2 x1 3/20 1


View Full Document
Loading Unlocking...
Login

Join to view Parallel Algorithms and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Parallel Algorithms and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?