DOC PREVIEW
UMD CMSC 423 - Gap Penalties

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Gap PenaltiesCMSC 423General Gap Penalties•Now, the cost of a run of k gaps is GAP × k•A solution to the problem above is to support general gap penalty, so that the score of a run of k gaps is gap(k) < GAP × k.•Then, the optimization will prefer to group gaps together.AAAGAATTCAA-A-A-T-CAAAAGAATTCAAAA----TCAvs.These have the same score, but the second one is often more plausible.A single insertion of “GAAT” into the first string could change it into the second.General Gap PenaltiesAAAGAATTCAA-A-A-T-CAAAAGAATTCAAAA----TCAvs.Previous DP no longer works with general gap penalties because the score of the last character depends on details of the previous alignment:AAAGAACAAA----AAAGAATCAAA-----vs.Instead, we need to “know” how the previous alignment ends in order to give a score to the last subproblem.Three MatricesWe now keep 3 different matrices: M[i,j] = score of best alignment of x[1..i] and y[1..j] ending with a character-character match or mismatch.X[i,j] = score of best alignment of x[1..i] and y[1..j] ending with a space in X.Y[i,j] = score of best alignment of x[1..i] and y[1..j] ending with a space in Y.M[i, j] = match(i, j) + maxM[i − 1,j − 1]X[i − 1,j − 1]Y [i − 1,j − 1]X[i, j] = max�M[i, j − k] − gap(k) for 1 ≤ k ≤ jY [i, j − k] − gap(k) for 1 ≤ k ≤ jY [i, j] = max�M[i − k, j ] − gap(k) for 1 ≤ k ≤ iX[i − k, j ] − gap(k) for 1 ≤ k ≤ iThe M MatrixM[i, j] = match(i, j) + maxM[i − 1,j − 1]X[i − 1,j − 1]Y [i − 1,j − 1]We now keep 3 different matrices: M[i,j] = score of best alignment of x[1..i] and y[1..j] ending with a character-character match or mismatch.X[i,j] = score of best alignment of x[1..i] and y[1..j] ending with a space in X.Y[i,j] = score of best alignment of x[1..i] and y[1..j] ending with a space in Y.By definition, alignment ends in a match.AGAny kind of alignment is allowed before the match.G---ACGTThe X (and Y) matricesX[i, j] = max�M[i, j − k] − gap(k) for 1 ≤ k ≤ jY [i, j − k] − gap(k) for 1 ≤ k ≤ ji-Gj-k jkxyG----CGTi-Gj-k jkxyk decides how long to make the gap. We have to make the whole gap at once in order to know how to score it.G---ACGTThe X (and Y) matricesX[i, j] = max�M[i, j − k] − gap(k) for 1 ≤ k ≤ jY [i, j − k] − gap(k) for 1 ≤ k ≤ ji-Gj-k jkxyG----CGTi-Gj-k jkxy----GCGTi-Gj-k jkxyThis case is automatically handled.k decides how long to make the gap. We have to make the whole gap at once in order to know how to score it.Running Time for Gap PenaltiesRuntime:•Assume |X| = |Y| = n for simplicity: 3n2 subproblems•2n2 subproblems take O(n) time to solve (because we have to try all k)󲰛O(n3) total timeM[i, j] = match(i, j) + maxM[i − 1,j − 1]X[i − 1,j − 1]Y [i − 1,j − 1]X[i, j] = max�M[i, j − k] − gap(k) for 1 ≤ k ≤ jY [i, j − k] − gap(k) for 1 ≤ k ≤ jY [i, j] = max�M[i − k, j ] − gap(k) for 1 ≤ k ≤ iX[i − k, j ] − gap(k) for 1 ≤ k ≤ iFinal score is max {M[n,m], X[n,m], Y[n,m]}. How do you do the traceback?Affine Gap Penalties•O(n3) for general gap penalties is usually too slow... •We can still encourage spaces to group together using a special case of general penalties called affine gap penalties:gap_start = the cost of starting a gapgap_extend = the cost of extending a gap by one more space•Same idea of using 3 matrices, but now we don’t need to search over all gap lengths, we just have to know whether we are starting a new gap or not.Affine Gap PenaltiesX[i, j] = maxgap start + gap extend + M[i, j − 1]gap extend + X[i, j − 1]gapstart + gap extend + Y [i, j − 1]Y [i, j] = maxgap start + gap extend + M[i − 1,j]gap start + gap extend + X[i − 1,j]gapextend + Y [i − 1,j]M[i, j] = match(i, j) + maxM[i − 1,j − 1]X[i − 1,j − 1]Y [i − 1,j − 1]gap in xgap in ymatch betweenx and yIf previous alignment ends in match, this is a new gapAffine Base Cases•M[0, i] = “score of best alignment between 0 characters of x and i characters of y that ends in a match” = -∞ because no such alignment can exist.•X[0, i] = “score of best alignment between 0 characters of x and i characters of y that ends in a gap in x” = gap_start + i × gap_extend because this alignment looks like: •X[i, 0] = “score of best alignment between i characters of x and 0 characters of y that ends in a gap in X” = -∞•M[i, 0] = M[0, i] and Y[0, i] and Y[i,0] are computed using the same logic as X[i,0] and X[0,i]---------yyyyyyyyyxxxxxxxxx-----------← not allowedAffine Gap Runtime•3n2 subproblems•Each one takes constant time•Total runtime O(n2), back to the run time of the basic running time.Why do you “need” 3 matrices?•Alternative WRONG algorithm:M[i][j] = max( M[i-1][j-1] + cost(x[i], y[i]), M[i-1][j] + gap + (gap_start if Arrow[i-1][j] != ), M[j][i-1] + gap + (gap_start if Arrow[i][j-1] != ))Intuition: we only need to know whether we are starting a gap or extending a gap.The arrows coming out of each subproblem tell us how the best alignment ends, so we can use them to decide if we are starting a new gap.The best alignment up to this cell ends in a match.The best alignment up to this cell ends in a gap.PROBLEM: The best alignment for strings x[1..i] and y[1..j] doesn’t have to be used in the best alignment between x[1..i+1] and y[1..j+1]Why 3 Matrices: ExampleCARTCA-Tmatch = 10, mismatch = -2, gap = -7, gap_start = -15OPT(4, 3) = optimal score = 30 - 15 - 7 = 8CARTSCA-T-WRONG(5, 3) = 30 - 15 - 7 - 15 - 7CARTSCAT--OPT(5, 3) = 20 - 2 - 15 - 14 this is why we need to keep the X and Y matrices around. they tell us the score of ending with a gap in one of the sequences.= -14= -11Side Note: Lower Bounds•Suppose the lengths of x and y are n.•Clearly, need at least Ω(n) time to find their global alignment (have to read the strings!)•The DP algorithms show global alignment can be done in O(n2) time.•A trick called the “Four Russians Speedup” can make a similar dynamic programming algorithm run in O(n2 / log n) time. •We won’t talk about the Four Russian’s Speedup, but it’s in your book in Sections 7.3 & 7.4.•Open questions: Can we do better? Can we prove that we can’t do better? No one knows...Recap•Semiglobal alignment: 0 initial columns or take maximums over last row or column.•Local alignment: extra “0” case.•General gap penalties


View Full Document

UMD CMSC 423 - Gap Penalties

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Download Gap Penalties
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Gap Penalties and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Gap Penalties 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?