Stanford CS 262 - The Smith-Waterman algorithm - D1977682

Home> Schools> Stanford University> Computer Science (CS) > CS 262> The Smith-Waterman algorithm

DOC PREVIEW

Stanford CS 262 - The Smith-Waterman algorithm

School name Stanford University

Course Cs 262- Computational Genomics

Pages 31

This preview shows page 1-2-14-15-30-31 out of 31 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization: F(0, j) = F(i, 0) = 0 0 Iteration: F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(xi, yj)The Smith-Waterman algorithm Termination: 1. If we want the best local alignment… FOPT = maxi,j F(i, j) Find FOPT and trace back 2. If we want all local alignments scoring > t ?? For all i, j find F(i, j) > t, and trace back? Complicated by overlapping local alignments Waterman–Eggert ’87: find all non-overlapping local alignments with minimal recalculation of the DP matrixScoring the gaps more accurately Current model: Gap of length n incurs penalty n×d However, gaps usually occur in bunches Concave gap penalty function γ(n) (aka Convex -γ(n)): γ(n): for all n, γ(n + 1) - γ(n) ≤ γ(n) - γ(n – 1) γ(n) γ(n)Convex gap dynamic programming Initialization: same Iteration: F(i – 1, j – 1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k, j) – γ(i – k) maxk=0…j-1F(i, k) – γ(j – k) Termination: same Running Time: O(N2M) (assume N>M) Space: O(NM)Compromise: affine gaps γ(n) = d + (n – 1)×e | | gap gap open extend To compute optimal alignment, At position i, j, need to “remember” best score if gap is open best score if gap is not open F(i, j): score of alignment x1…xi to y1…yj if xi aligns to yj G(i, j): score if xi aligns to a gap after yj H(i, j): score if yj aligns to a gap after xi V(i, j) = best score of alignment x1…xi to y1…yj d e γ(n)Needleman-Wunsch with affine gaps Why do we need matrices F, G, H? • xi aligns to yj x1……xi-1 xi xi+1 y1……yj-1 yj - • xi aligns to a gap after yj x1……xi-1 xi xi+1 y1……yj …- - Add -d Add -e G(i+1, j) = F(i, j) – d G(i+1, j) = G(i, j) – e Because, perhaps G(i, j) < V(i, j) (it is best to align xi to yj if we were aligning only x1…xi to y1…yj and not the rest of x, y), but on the contrary G(i, j) – e > V(i, j) – d (i.e., had we “fixed” our decision that xi aligns to yj, we could regret it at the next step when aligning x1…xi+1 to y1…yj)Needleman-Wunsch with affine gaps Initialization: V(i, 0) = d + (i – 1)×e V(0, j) = d + (j – 1)×e Iteration: V(i, j) = max{ F(i, j), G(i, j), H(i, j) } F(i, j) = V(i – 1, j – 1) + s(xi, yj) V(i – 1, j) – d G(i, j) = max G(i – 1, j) – e V(i, j – 1) – d H(i, j) = max H(i, j – 1) – e Termination: V(i, j) has the best alignment Time? Space?To generalize a bit… … think of how you would compute optimal alignment with this gap function ….in time O(MN) γ(n)Bounded Dynamic Programming Assume we know that x and y are very similar Assumption: # gaps(x, y) < k(N) xi Then, | implies | i – j | < k(N) yj We can align x and y more efficiently: Time, Space: O(N × k(N)) << O(N2)Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(xi, yj) F(i, j) = max F(i, j – 1) – d, if j > i – k(N) F(i – 1, j) – d, if j < i + k(N) Termination: same Easy to extend to the affine gap case x1 ………………………… xM y1 ………………………… yN k(N)Outline • Linear-Space Alignment • BLAST – local alignment search • Ultra-fast alignment for (human) genome resequencingLinear-Space AlignmentSubsequences and Substrings Definition A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix string v (similarly, x’ = xi…xj, for some 1 ≤ i ≤ j ≤ |x|) A string x’ is a subsequence of a string x if x’ can be obtained from x by deleting 0 or more letters (x’ = xi1…xik, for some 1 ≤ i1 ≤ … ≤ ik ≤ |x|) Note: a substring is always a subsequence Example: x = abracadabra y = cadabr; substring z = brcdbr; subseqence, not substringHirschberg’s algortihm Given a set of strings x, y,…, a common subsequence is a string u that is a subsequence of all strings x, y, … • Longest common subsequence  Given strings x = x1 x2 … xM, y = y1 y2 … yN,  Find longest common subsequence u = u1 … uk • Algorithm: F(i – 1, j) • F(i, j) = max F(i, j – 1) F(i – 1, j – 1) + [1, if xi = yj; 0 otherwise] • Ptr(i, j) = (same as in N-W) • Termination: trace back from Ptr(M, N), and prepend a letter to u whenever • Ptr(i, j) = DIAG and F(i – 1, j – 1) < F(i, j) • Hirschberg’s original algorithm solves this in linear spaceF(i,j) Introduction: Compute optimal score It is easy to compute F(M, N) in linear space Allocate ( column[1] ) Allocate ( column[2] ) For i = 1….M If i > 1, then: Free( column[ i – 2 ] ) Allocate( column[ i ] ) For j = 1…N F(i, j) = …Linear-space alignment To compute both the optimal score and the optimal alignment: Divide & Conquer approach: Notation: xr, yr: reverse of x, y E.g. x = accgg; xr = ggcca Fr(i, j): optimal score of aligning xr1…xri & yr1…yrj same as aligning xM-i+1…xM & yN-j+1…yNLinear-space alignment Lemma: (assume M is even) F(M, N) = maxk=0…N( F(M/2, k) + Fr(M/2, N – k) ) x y M/2 k* F(M/2, k) Fr(M/2, N – k) Example: ACC-GGTGCCCAGGACTG--CAT ACCAGGTG----GGACTGGGCAG k* = 8Linear-space alignment • Now, using 2 columns of space, we can compute for k = 1…M, F(M/2, k), Fr(M/2, N – k) PLUS the backpointers x1 … xM/2 y1 xM yN x1 … xM/2+1 xM … y1 yN …Linear-space alignment • Now, we can find k* maximizing F(M/2, k) + Fr(M/2, N-k) • Also, we can trace the path exiting column M/2 from k* k* k*+1 0 1 …… M/2 M/2+1 …… M M+1Linear-space alignment • Iterate this procedure to the left and right! N-k* M/2 M/2 k*Linear-space alignment Hirschberg’s Linear-space algorithm: MEMALIGN(l, l’, r, r’): (aligns xl…xl’ with yr…yr’) 1. Let h = ⎡(l’-l)/2⎤ 2. Find (in Time O((l’ – l) × (r’ – r)), Space O(r’ – r)) the optimal path, Lh, entering column h – 1, exiting

View Full Document