CS 267 Tricks with TreesOutlineA log n lower bound to compute any function of n variablesBroadcasts and Reductions on TreesParallel Prefix, or ScanMapping Parallel Prefix onto a Tree - DetailsE.g., Fibonacci via Matrix Multiply PrefixAdding two n-bit integers in O(log n) timeOther applications of scansMultiplying n-by-n matrices in O(log n) timeInverting triangular n-by-n matrices in O(log2 n) timeInverting Dense n-by-n matrices in O(log n) timeEvaluating arbitrary expressionsEvaluating recurrencesSummary of tree algorithms02/07/06 CS267 Lecture 71CS 267Tricks with TreesJames Demmelwww.cs.berkeley.edu/~demmel/cs267_Spr0602/07/06 CS267 Lecture 72Outline°A log n lower bound to compute any function in parallel°Reduction and broadcast in O(log n) time°Parallel prefix (scan) in O(log n) time°Adding two n-bit integers in O(log n) time°Multiplying n-by-n matrices in O(log n) time°Inverting n-by-n triangular matrices in O(log n) time°Inverting n-by-n dense matrices in O(log n) time°Evaluating arbitrary expressions in O(log n) time°Evaluating recurrences in O(log n) time°Solving n-by-n tridiagonal matrices in O(log n) time°Traversing linked lists °Computing minimal spanning trees°Computing convex hulls of point sets2202/07/06 CS267 Lecture 73A log n lower bound to compute any function of n variables°Assume we can only use binary operations, one per time unit°After 1 time unit, an output can only depend on two inputs°Use induction to show that after k time units, an output can only depend on 2k inputs•After log2 n time units, output depends on at most n inputs°A binary tree performs such a computation02/07/06 CS267 Lecture 74Broadcasts and Reductions on Trees02/07/06 CS267 Lecture 75Parallel Prefix, or Scan°If “+” is an associative operator, and x[0],…,x[p-1] are input data then parallel prefix operation computes°Notation: j:k mean x[j]+x[j+1]+…+x[k], blue is final valuey[j] = x[0] + x[1] + … + x[j] for j=0,1,…,p-102/07/06 CS267 Lecture 76Mapping Parallel Prefix onto a Tree - Details°Up-the-tree phase (from leaves to root)°By induction, Lsave = sum of all leaves in left subtree°Down the tree phase (from root to leaves)°By induction, S = sum of all leaves to left of subtree rooted at the parent1) Get values L and R from left and right children2) Save L in a local register Lsave3) Pass sum L+R to parent1) Get value S from parent (the root gets 0)2) Send S to the left child3) Send S + Lsave to the right child02/07/06 CS267 Lecture 77E.g., Fibonacci via Matrix Multiply PrefixFn+1 = Fn + Fn-11-nnn1nFF 0111FFCan compute all Fn by matmul_prefix on [ , , , , , , , , ]then select the upper left entry 011101110111011101110111011101110111Slide source: Alan Edelman, MIT02/07/06 CS267 Lecture 78Adding two n-bit integers in O(log n) time°Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit binary numbers°We want their sum s = a+b = s[n]s[n-1]…s[0]°Challenge: compute all c[i] in O(log n) time via parallel prefix°Used in all computers to implement addition - Carry look-aheadc[-1] = 0 … rightmost carry bitfor i = 0 to n-1 c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] ) ... next carry bit s[i] = ( a[i] xor b[i] ) xor c[i-1] for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = C[i] * c[i-1] 1 1 0 1 1 1 … 2-by-2 Boolean matrix multiplication (associative) = C[i] * C[i-1] * … C[0] * 0 1 … evaluate each P[i] = C[i] * C[i-1] * … * C[0] by parallel prefix02/07/06 CS267 Lecture 79Other applications of scans°There are several applications of scans, some more obvious than others•add multi-precision numbers (represented as array of numbers)•evaluate recurrences, expressions •solve tridiagonal systems (numerically unstable!)•implement bucket sort and radix sort•to dynamically allocate processors•to search for regular expression (e.g., grep)°Names: +\ (APL), cumsum (Matlab), MPI_SCAN°Note: 2n operations used when only n-1 needed02/07/06 CS267 Lecture 710Multiplying n-by-n matrices in O(log n) time°For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j)•cost = 1 time unit, using n^3 processors°For all (1 <= I,j <= n) C(i,j) = P(i,j,k)•cost = O(log n) time, using a tree with n^3 / 2 processorsk =1n02/07/06 CS267 Lecture 711Inverting triangular n-by-n matrices in O(log2 n) time°Fact:°Function Tri_Inv(T) … assume n = dim(T) = 2m for simplicity°time(Tri_Inv(n)) = time(Tri_Inv(n/2)) + O(log(n))•Change variable to m = log n to get time(Tri_Inv(n)) = O(log2n)A 0C B-1= A 0-B CA B-1-1-1 -1If T is 1-by-1 return 1/Telse … Write T = A 0 C B In parallel do { invA = Tri_Inv(A) invB = Tri_Inv(B) } … implicitly uses a tree newC = -invB * C * invA Return invA 0 newC invB02/07/06 CS267 Lecture 712Inverting Dense n-by-n matrices in O(log n) time°Lemma 1: Cayley-Hamilton Theorem•expression for A-1 via characteristic polynomial in A°Lemma 2: Newton’s Identities•Triangular system of equations for coefficients of characteristic polynomial, matrix entries = sk°Lemma 3: sk = trace(Ak) = Ak [i,i] = [i (A)]k°Csanky’s Algorithm (1976)°Completely numerically unstable2i=1ni=1n1) Compute the powers A2, A3, …,An-1 by parallel prefix cost = O(log2 n)2) Compute the traces sk = trace(Ak) cost = O(log n)3) Solve Newton identities for coefficients of characteristic polynomial cost = O(log2 n)4) Evaluate A-1 using Cayley-Hamilton Theorem cost = O(log n)02/07/06 CS267 Lecture 713Evaluating arbitrary expressions°Let E be an arbitrary expression formed from +, -, *, /, parentheses, and n variables, where each appearance of each variable is counted separately°Can think of E as arbitrary expression tree with n leaves (the variables) and
View Full Document