CS 267 Tricks with TreesOutlineSlide 3Slide 4A log n lower bound to compute any function of n variablesBroadcasts and Reductions on TreesParallel Prefix, or ScanMapping Parallel Prefix onto a Tree - DetailsE.g., Fibonacci via Matrix Multiply PrefixAdding two n-bit integers in O(log n) timeOther applications of scansMultiplying n-by-n matrices in O(log n) timeInverting triangular n-by-n matrices in O(log2 n) timeInverting Dense n-by-n matrices in O(log n) timeEvaluating arbitrary expressionsEvaluating recurrencesImage Segmentation (1/4)Image Segmentation (2/4)Image Segmentation (3/4)Image Segmentation (4/4)Sparse-Matrix-Vector-Multiply (SpMV) y = A*x Using Segmented Scan (SegScan)Page layout in a browserSummary of tree algorithms102/09/2010 CS267 Lecture 7+CS 267Tricks with TreesJames Demmel and Horst Simonwww.cs.berkeley.edu/~demmel/cs267_Spr10202/09/2010 CS267 Lecture 7+Outline°A log n lower bound to compute any function in parallel°Reduction and broadcast in O(log n) time°Parallel prefix (scan) in O(log n) time°Adding two n-bit integers in O(log n) time°Multiplying n-by-n matrices in O(log n) time°Inverting n-by-n triangular matrices in O(log2 n) time°Inverting n-by-n dense matrices in O(log2 n) time°Evaluating arbitrary expressions in O(log n) time°Evaluating recurrences in O(log n) time302/09/2010 CS267 Lecture 7+Outline°A log n lower bound to compute any function in parallel°Reduction and broadcast in O(log n) time°Parallel prefix (scan) in O(log n) time°Adding two n-bit integers in O(log n) time°Multiplying n-by-n matrices in O(log n) time°Inverting n-by-n triangular matrices in O(log2 n) time°Inverting n-by-n dense matrices in O(log2 n) time°Evaluating arbitrary expressions in O(log n) time°Evaluating recurrences in O(log n) time°“2D parallel prefix”, for image segmentation (Bryan Catanzaro, Kurt Keutzer)°Sparse-Matrix-Vector-Multiply (SpMV) using Segmented Scan°Parallel page layout in a browser (Leo Meyerovich, Ras Bodik)402/09/2010 CS267 Lecture 7+Outline°A log n lower bound to compute any function in parallel°Reduction and broadcast in O(log n) time°Parallel prefix (scan) in O(log n) time°Adding two n-bit integers in O(log n) time°Multiplying n-by-n matrices in O(log n) time°Inverting n-by-n triangular matrices in O(log2 n) time°Inverting n-by-n dense matrices in O(log2 n) time°Evaluating arbitrary expressions in O(log n) time°Evaluating recurrences in O(log n) time°“2D parallel prefix”, for image segmentation (Bryan Catanzaro, Kurt Keutzer)°Sparse-Matrix-Vector-Multiply (SpMV) using Segmented Scan°Parallel page layout in a browser (Leo Meyerovich, Ras Bodik)°Solving n-by-n tridiagonal matrices in O(log n) time°Traversing linked lists °Computing minimal spanning trees°Computing convex hulls of point sets502/09/2010 CS267 Lecture 7+A log n lower bound to compute any function of n variables°Assume we can only use binary operations, one per time unit°After 1 time unit, an output can only depend on two inputs°Use induction to show that after k time units, an output can only depend on 2k inputs•After log2 n time units, output depends on at most n inputs°A binary tree performs such a computation602/09/2010 CS267 Lecture 7+Broadcasts and Reductions on Trees702/09/2010 CS267 Lecture 7+Parallel Prefix, or Scan°If “+” is an associative operator, and x[0],…,x[p-1] are input data then parallel prefix operation computes°Notation: j:k mean x[j]+x[j+1]+…+x[k], blue is final valuey[j] = x[0] + x[1] + … + x[j] for j=0,1,…,p-1802/09/2010 CS267 Lecture 7+Mapping Parallel Prefix onto a Tree - Details°Up-the-tree phase (from leaves to root)°By induction, Lsave = sum of all leaves in left subtree°Down the tree phase (from root to leaves)°By induction, S = sum of all leaves to left of subtree rooted at the parent1) Get values L and R from left and right children2) Save L in a local register Lsave3) Pass sum L+R to parent1) Get value S from parent (the root gets 0)2) Send S to the left child3) Send S + Lsave to the right child902/09/2010 CS267 Lecture 7+E.g., Fibonacci via Matrix Multiply PrefixFn+1 = Fn + Fn-11-nnn1nFF 0111FFCan compute all Fn by matmul_prefix on [ , , , , , , , , ]then select the upper left entry 011101110111011101110111011101110111Slide source: Alan Edelman, MIT1002/09/2010 CS267 Lectur+e 7Adding two n-bit integers in O(log n) time°Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit binary numbers°We want their sum s = a+b = s[n]s[n-1]…s[0]°Challenge: compute all c[i] in O(log n) time via parallel prefix°Used in all computers to implement addition - Carry look-aheadc[-1] = 0 … rightmost carry bitfor i = 0 to n-1 c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] ) ... next carry bit s[i] = ( a[i] xor b[i] ) xor c[i-1] for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = C[i] * c[i-1] 1 1 0 1 1 1 … 2-by-2 Boolean matrix multiplication (associative) = C[i] * C[i-1] * … C[0] * 0 1 … evaluate each P[i] = C[i] * C[i-1] * … * C[0] by parallel prefix1102/09/2010 CS267 Lecture 7+Other applications of scans°There are many applications of scans, some more obvious than others•add multi-precision numbers (represented as array of numbers)•evaluate recurrences, expressions •solve tridiagonal systems (numerically unstable!)•implement bucket sort and radix sort•to dynamically allocate processors•to search for regular expression (e.g., grep)°Names: +\ (APL), cumsum (Matlab), MPI_SCAN°Note: 2n operations used when only n-1 needed1202/09/2010 CS267 Lecture 7+Multiplying n-by-n matrices in O(log n) time°For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j)•cost = 1 time unit, using n3 processors°For all (1 <= i,j <= n) C(i,j) = P(i,j,k)•cost = O(log n) time, using a tree with n3 / 2 processorsk =1n1302/09/2010 CS267 Lecture 7+Inverting triangular
View Full Document