The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization: F(0, j) = F(i, 0) = 0 0 Iteration: F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(xi, yj)The Smith-Waterman algorithm Termination: 1. If we want the best local alignment… FOPT = maxi,j F(i, j) Find FOPT and trace back 2. If we want all local alignments scoring > t ?? For all i, j find F(i, j) > t, and trace back? Complicated by overlapping local alignments Waterman–Eggert ’87: find all non-overlapping local alignments with minimal recalculation of the DP matrixScoring the gaps more accurately Current model: Gap of length n incurs penalty n×d However, gaps usually occur in bunches Concave gap penalty function γ(n) (aka Convex -γ(n)): γ(n): for all n, γ(n + 1) - γ(n) ≤ γ(n) - γ(n – 1) γ(n) γ(n)Convex gap dynamic programming Initialization: same Iteration: F(i – 1, j – 1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k, j) – γ(i – k) maxk=0…j-1F(i, k) – γ(j – k) Termination: same Running Time: O(N2M) (assume N>M) Space: O(NM)Compromise: affine gaps γ(n) = d + (n – 1)×e | | gap gap open extend To compute optimal alignment, At position i, j, need to “remember” best score if gap is open best score if gap is not open F(i, j): score of alignment x1…xi to y1…yj if xi aligns to yj G(i, j): score if xi aligns to a gap after yj H(i, j): score if yj aligns to a gap after xi V(i, j) = best score of alignment x1…xi to y1…yj d e γ(n)Needleman-Wunsch with affine gaps Why do we need matrices F, G, H? • xi aligns to yj x1……xi-1 xi xi+1 y1……yj-1 yj - • xi aligns to a gap after yj x1……xi-1 xi xi+1 y1……yj …- - Add -d Add -e G(i+1, j) = F(i, j) – d G(i+1, j) = G(i, j) – e Because, perhaps G(i, j) < V(i, j) (it is best to align xi to yj if we were aligning only x1…xi to y1…yj and not the rest of x, y), but on the contrary G(i, j) – e > V(i, j) – d (i.e., had we “fixed” our decision that xi aligns to yj, we could regret it at the next step when aligning x1…xi+1 to y1…yj)Needleman-Wunsch with affine gaps Initialization: V(i, 0) = d + (i – 1)×e V(0, j) = d + (j – 1)×e Iteration: V(i, j) = max{ F(i, j), G(i, j), H(i, j) } F(i, j) = V(i – 1, j – 1) + s(xi, yj) V(i – 1, j) – d G(i, j) = max G(i – 1, j) – e V(i, j – 1) – d H(i, j) = max H(i, j – 1) – e Termination: V(i, j) has the best alignment Time? Space?To generalize a bit… … think of how you would compute optimal alignment with this gap function ….in time O(MN) γ(n)Bounded Dynamic Programming Assume we know that x and y are very similar Assumption: # gaps(x, y) < k(N) xi Then, | implies | i – j | < k(N) yj We can align x and y more efficiently: Time, Space: O(N × k(N)) << O(N2)Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(xi, yj) F(i, j) = max F(i, j – 1) – d, if j > i – k(N) F(i – 1, j) – d, if j < i + k(N) Termination: same Easy to extend to the affine gap case x1 ………………………… xM y1 ………………………… yN k(N)Outline • Linear-Space Alignment • BLAST – local alignment search • Ultra-fast alignment for (human) genome resequencingLinear-Space AlignmentSubsequences and Substrings Definition A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix string v (similarly, x’ = xi…xj, for some 1 ≤ i ≤ j ≤ |x|) A string x’ is a subsequence of a string x if x’ can be obtained from x by deleting 0 or more letters (x’ = xi1…xik, for some 1 ≤ i1 ≤ … ≤ ik ≤ |x|) Note: a substring is always a subsequence Example: x = abracadabra y = cadabr; substring z = brcdbr; subseqence, not substringHirschberg’s algortihm Given a set of strings x, y,…, a common subsequence is a string u that is a subsequence of all strings x, y, … • Longest common subsequence Given strings x = x1 x2 … xM, y = y1 y2 … yN, Find longest common subsequence u = u1 … uk • Algorithm: F(i – 1, j) • F(i, j) = max F(i, j – 1) F(i – 1, j – 1) + [1, if xi = yj; 0 otherwise] • Ptr(i, j) = (same as in N-W) • Termination: trace back from Ptr(M, N), and prepend a letter to u whenever • Ptr(i, j) = DIAG and F(i – 1, j – 1) < F(i, j) • Hirschberg’s original algorithm solves this in linear spaceF(i,j) Introduction: Compute optimal score It is easy to compute F(M, N) in linear space Allocate ( column[1] ) Allocate ( column[2] ) For i = 1….M If i > 1, then: Free( column[ i – 2 ] ) Allocate( column[ i ] ) For j = 1…N F(i, j) = …Linear-space alignment To compute both the optimal score and the optimal alignment: Divide & Conquer approach: Notation: xr, yr: reverse of x, y E.g. x = accgg; xr = ggcca Fr(i, j): optimal score of aligning xr1…xri & yr1…yrj same as aligning xM-i+1…xM & yN-j+1…yNLinear-space alignment Lemma: (assume M is even) F(M, N) = maxk=0…N( F(M/2, k) + Fr(M/2, N – k) ) x y M/2 k* F(M/2, k) Fr(M/2, N – k) Example: ACC-GGTGCCCAGGACTG--CAT ACCAGGTG----GGACTGGGCAG k* = 8Linear-space alignment • Now, using 2 columns of space, we can compute for k = 1…M, F(M/2, k), Fr(M/2, N – k) PLUS the backpointers x1 … xM/2 y1 xM yN x1 … xM/2+1 xM … y1 yN …Linear-space alignment • Now, we can find k* maximizing F(M/2, k) + Fr(M/2, N-k) • Also, we can trace the path exiting column M/2 from k* k* k*+1 0 1 …… M/2 M/2+1 …… M M+1Linear-space alignment • Iterate this procedure to the left and right! N-k* M/2 M/2 k*Linear-space alignment Hirschberg’s Linear-space algorithm: MEMALIGN(l, l’, r, r’): (aligns xl…xl’ with yr…yr’) 1. Let h = ⎡(l’-l)/2⎤ 2. Find (in Time O((l’ – l) × (r’ – r)), Space O(r’ – r)) the optimal path, Lh, entering column h – 1, exiting
View Full Document