Parallel Prefix Algorithms, or Tricks with TreesParallel Vector OperationsBroadcast and reductionParallel Prefix AlgorithmsExample of a prefixWhat do you think?A clue?Slide 8Slide 9Slide 10Slide 11Non-recursive view of parallel prefix scanSlide 13Scan (Parallel Prefix) OperationsApplications of scansE.g., Using Scans for Array CompressionE.g., Fibonacci via Matrix Multiply PrefixCarry-Look Ahead Addition (Babbage 1800’s)Slide 19Slide 20Slide 21Slide 22Adding two n-bit integers in O(log n) timeSegmented OperationsSlide 25Slide 26Copy Prefix: x y = x (is associative)Multiplying n-by-n matrices in O(log n) timeInverting dense n-by-n matrices in O(log2 n) timeEvaluating arbitrary expressionsThe myth of log nSummary of tree algorithmsParallel Prefix Algorithms,orTricks with TreesSome slides from Jim Demmel, Kathy Yelick, Alan Edelman, and a cast of thousands …Parallel Vector Operations•Vector add: z = x + y •Embarrassingly parallel if vectors are aligned•DAXPY: z = a*x + y (a is scalar)•Broadcast a, followed by independent * and +•DDOT: s = xTy = j x[j] * y[j]•Independent * followed by + reductionBroadcast and reduction•Broadcast of 1 value to p processors in log p time•Reduction of p values to 1 in log p time•Takes advantage of associativity in +, *, min, max, etc.a8 1 3 1 0 4 -6 3 2Add-reductionBroadcast•A theoretical secret for turning serial into parallel•Surprising parallel algorithms: If “there is no way to parallelize this algorithm!” …•… it’s probably a variation on parallel prefix!Parallel Prefix AlgorithmsExample of a prefixSum Prefix Input x = (x1, x2, . . ., xn) Output y = (y1, y2, . . ., yn) yi = Σj=1:i xjExample x = ( 1, 2, 3, 4, 5, 6, 7, 8 ) y = ( 1, 3, 6, 10, 15, 21, 28, 36)Prefix Functions-- outputs depend upon an initial stringWhat do you think?•Can we really parallelize this?•It looks like this kind of code: y(0) = 0; for i = 1:n y(i) = y(i-1) + x(i); •The ith iteration of the loop depends completely on the (i-1)st iteration. •Work = n, span = n, parallelism = 1.•Impossible to parallelize, right?A clue? x = ( 1, 2, 3, 4, 5, 6, 7, 8 ) y = ( 1, 3, 6, 10, 15, 21, 28, 36)Is there any value in adding, say, 4+5+6+7?If we separately have 1+2+3, what can we do?Suppose we added 1+2, 3+4, etc. pairwise -- what could we do?8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3 7 11 15 19 23 27 31 (Recursively compute prefix sums) 3 10 21 36 55 78 105 136 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136Prefix sum in parallelAlgorithm: 1. Pairwise sum 2. Recursive prefix 3. Pairwise sum01/15/199• What’s the total work? 1 2 3 4 5 6 7 8 Pairwise sums 3 7 11 15 Recursive prefix 3 10 21 36 Update “odds” 1 3 6 10 15 21 28 36•T1(n) = n/2 + n/2 + T1 (n/2) = n + T1 (n/2) = 2n – 1at the cost of more work!Parallel prefix cost01/15/1910• What’s the total work? 1 2 3 4 5 6 7 8 Pairwise sums 3 7 11 15 Recursive prefix 3 10 21 36 Update “odds” 1 3 6 10 15 21 28 36•T1(n) = n/2 + n/2 + T1 (n/2) = n + T1 (n/2) = 2n – 1 Parallelism at the cost of more work!Parallel prefix cost01/15/19• What’s the total work? 1 2 3 4 5 6 7 8 Pairwise sums 3 7 11 15 Recursive prefix 3 10 21 36 Update “odds” 1 3 6 10 15 21 28 36•T1(n) = n/2 + n/2 + T1 (n/2) = n + T1 (n/2) = 2n – 1•T∞(n) = 2 log n Parallelism at the cost of more work!11Parallel prefix cost: Work and Span12Non-recursive view of parallel prefix scan•Tree summation: two phases•up sweep•get values L and R from left and right child•save L in local variable Mine•compute Tmp = L + R and pass to parent•down sweep•get value Tmp from parent•send Tmp to left child•send Tmp+Mine to right child6543 2 4 1Up sweep: mine = left tmp = left + right46 95 43 1 2 0 4 1 1 36543 2 4 10600 33 4 6 6 10 11 12 15+X = 3 1 2 0 4 1 1 344 6 6 10 116 1112Down sweep: tmp = parent (root is 0) right = tmp + mineAll (and) Any ( or)Input: Bits (Boolean)Sum (+)Product (*)MaxMinInput: RealsAny associative operation worksAssociative:(a b) c = a (b c)MatMulInput: Matrices01/15/1914Scan (Parallel Prefix) Operations•Definition: the parallel prefix operation takes a binary associative operator , and an array of n elements [a0, a1, a2, … an-1] and produces the array [a0, (a0 a1), … (a0 a1 ... an-1)]•Example: add scan of [1, 2, 0, 4, 2, 1, 1, 3] is [1, 3, 3, 7, 9, 10, 11, 14]01/15/1915Applications of scans•Many applications, some more obvious than others•lexically compare strings of characters•add multi-precision numbers•add binary numbers fast in hardware•graph algorithms•evaluate polynomials•implement bucket sort, radix sort, and even quicksort•solve tridiagonal linear systems•solve recurrence relations•dynamically allocate processors•search for regular expression (grep)•image processing primitives01/15/1916E.g., Using Scans for Array Compression•Given an array of n elements [a0, a1, a2, … an-1] and an array of flags [1,0,1,1,0,0,1,…] compress the flagged elements into [a0, a2, a3, a6, …]•Compute an add scan of [0, flags] :[0,1,1,2,3,3,4,…]•Gives the index of the ith element in the compressed array•If the flag for this element is 1, write it into the result array at the given position01/15/1917E.g., Fibonacci via Matrix Multiply PrefixFn+1 = Fn + Fn-11-nnn1nFF 0111FFCan compute all Fn by matmul_prefix on [ , , , , , , , , ]then select the upper left entry 011101110111011101110111011101110111Carry-Look Ahead Addition (Babbage 1800’s)Goal: Add Two n-bit Integers Example 1 0 1 1 1 Carry 1 0 1 1 1 First Int 1 0 1 0 1 Second Int 1 0 1 1 0 0
View Full Document