Unformatted text preview:

CSE 397-497: Computational Issues in Molecular BiologyLopresti · Spring 2004 · Lecture 6- 1 -CSE 397-497:Computational Issues inMolecular BiologyLecture 6Spring 2004CSE 397-497: Computational Issues in Molecular BiologyLopresti · Spring 2004 · Lecture 6- 2 -Topics for todayBased on premise that algorithms we've studied are too slow:Note: we can understand a little bit of the "why," but finer details go beyond a computer scientist's level of understanding.• Faster method for global comparison when sequences are similar (§3.3.4).•PAM matrices (§3.5.1).• BLAST (§3.5.2 + Altschul et al. '90 paper).•FAST (§3.5.3 + Lipman & Pearson '85 paper).Even this might not be fast enough. Also, biologists more commonly need local comparison. What they really use is:CSE 397-497: Computational Issues in Molecular BiologyLopresti · Spring 2004 · Lecture 6- 3 -SpeedRecall that all of the dynamic programming algorithms for basic sequence comparison we've discussed so far take time O(mn).From a computer science standpoint, this would be considered reasonably efficient under most circumstances.But this isn't fast enough when you consider the lengths of the sequences and the sizes of the databases in question."GenBank continues to grow at an exponential rate with 5.4 million new sequences added over the past 12 months. As of Release 131 in August 2002, GenBank contained over 22.6 billion nucleotide bases from 18.2 million different sequences.”“GenBank,” D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell and D. L. Wheeler, Nucleic Acids Research, 2003, Vol. 31, No. 1, pp. 23-27.CSE 397-497: Computational Issues in Molecular BiologyLopresti · Spring 2004 · Lecture 6- 4 -Observations•To identifiy all homologous sequences in a database.*• To identify "motifs" with a sequence-similarity significantly better than random chance.In general, we only really care about the comparison when the sequences in question are in some way "similar." E.g.,* In the literature, the terms "similarity'' and "homology'' are often used interchangeably. To be accurate, similarity refers to sequences being similar according to certain statistical criteria, while homology refers to the molecules performing similar biological functions or having evolved from a common ancestor. Although similar sequences tend to perform similar functions, this is by no means guaranteed. The only way to establish homology is through experiment.Two "rules of thumb" we'll be using:•If two sequences are similar, their optimal alignment in a global comparison can't include many deletions or insertions.• If two sequences are similar, they're likely to contain short subsequences that match exactly.CSE 397-497: Computational Issues in Molecular BiologyLopresti · Spring 2004 · Lecture 6- 5 -A faster method for comparing similar sequencesIn global case, when two sequences are similar, it's possible to compare them more rapidly. What does “similar” imply? That their lengths must be nearly the same. Why is this true?sim s ,t  ≤ cmatch⋅min m , n  cindel⋅∣m − n∣Say costs are constant: cmatch (> 0) is cost of match and cindel (< 0) is cost of insertion/deletion. What do we know?In other words, maximum possible similarity drops when m is much different from n. This makes sense because we have to delete or insert additional symbols to make lengths correspond.CSE 397-497: Computational Issues in Molecular BiologyLopresti · Spring 2004 · Lecture 6- 6 -A faster method for comparing similar sequencesFor simplicity, let's say lengths of s and t are identical. Hence, dynamic programming matrix is a perfect square:string tstring soptimal pathstarts hereoptimal pathends heremain diagonalEvery time the path takes a step away from the main diagonal (e.g., via a deletion), there must be a corresponding step towards the diagonal somewhere (e.g., an offsetting insertion).The total cost of this step is 2 · cindel. We know where the optimal path must start and where it must end.What could happen in between?CSE 397-497: Computational Issues in Molecular BiologyLopresti · Spring 2004 · Lecture 6- 7 -If two sequences are nearly identical (5), what might optimal path look like?(There can be at most one indel pair; one possible path shown.)If two sequences are identical (8), what must optimal path look like?A faster method for comparing similar sequencesFor simplicity, let's say that cmatch = 1 and cindel = -1.Then the maximum possible similarity between two sequences of length 8 is _____.8-1-1CSE 397-497: Computational Issues in Molecular BiologyLopresti · Spring 2004 · Lecture 6- 8 -A faster method for comparing similar sequencesif optimal path stays near maindiagonal, we get same optimal resultNotice what's happening: if optimal path takes route near main diagonal, we only need to compute shaded cells.if optimal path strays far from main diagonal, we get suboptimal resultCSE 397-497: Computational Issues in Molecular BiologyLopresti · Spring 2004 · Lecture 6- 9 -A faster method for comparing similar sequencesa [i , j ] = max{a [i−1, j ]cdels[i]a [i , j−1]cinst [ j]a [i−1, j−1]csubs[i ] ,t [ j ]a [i , j] = max{a [i−1, j ]cdels[i]a [i−1, j−1]csubs[i] ,t [ j]CSE 397-497: Computational Issues in Molecular BiologyLopresti · Spring 2004 · Lecture 6- 10 -A faster method for comparing similar sequencesSay we compute in band k around main diagonal (inner loop of recurrence becomes -k  i – j  k), then time complexity is O(kn).kkHow do we know if we actually found an optimal alignment?cmatch⋅n−k 1  cindel⋅2k1Well, we know the algorithm is exact if the maximum number of indel pairs is at most k, so if the score we get is greater than any score we could get for (k+1) or more indel pairs, we win:Provided score is greater than this value, we know it's optimal.CSE 397-497: Computational Issues in Molecular BiologyLopresti · Spring 2004 · Lecture 6- 11 -A faster method for comparing similar sequencesHow are we supposed to know right value of k to start off?Try k = 1 and see if we found an optimal path ...... if not, try k = 2, 4, 8, 16, etc.Time complexity is n + 2n + 4n + ... + kn  2kn.I.e., it doesn't hurt not knowing final k to begin with.Upper bound on k is difference between maximum possible score and true optimal score (i.e., distance off


View Full Document

LEHIGH CSE 397 - Computational Issues

Download Computational Issues
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Computational Issues and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Computational Issues 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?