Unformatted text preview:

1Applications of Suffix TreesDr. Amar MukherjeeCAP 5937 – ST: BioinformaticsUniversity of central Florida7.1 Exact String Matching Three important variants: Both P (|P|=n) and T (|T|=m) are known:  Suffix tree method achieves same worst-case bound O(n+m) as KMP or BM. T is fixed and build suffix tree, then P is input: k: number of occurrences of P Using suffix tree: O(n+k) In contrast ( preprocess P): O(n+m) for any single P P is fixed, then T is input Select KMP or BM rather than suffix tree.27.2 Exact Set Matching Both Aho-Corasick and Suffix methods find all occurrences of P in T in O(n+m+k). But have preference by case. Comparison: AC: build keyword tree: size O(n), time O(n). When set of patterns is larger than T, suffix tree approach uses less space, but more time to search. E.g. molecular biology, where pattern library is large. When total size of patterns is smaller than T, AC method use less space. But suffix tree uses less time. Neither method is superior in time and space. One case where suffix tree is better, see Application 8. Time/space trade-off remains, but suffix tree can be used for chosen time/space combinations, whereas no choice for keyword tree.37.3 Substring Problem for a Database of Patterns The most interesting version: A set of string, or a database, is known and fixed. A sequence of strings will be presented. For each presented string S, find all the strings in the database containing S as a substring. The total length of all the strings, m, in the database is assumed to be large. In the context of genomic DNA data, the problem of finding substring cannot be solved by exact set matching. Suffix tree solution: A generalized suffix tree is built for database in O(m) time and O(m) space.  Any single string S of length n can be found or declared not to be there in O(n) time. If S matches a path in the tree. A full string is in S iff matching path reaches a leaf when last symbol of S is examined. Find all occurrences containing S as substring in O(n+k) time by traversing subtree below where S is found.47.4. Longest Common Substring for Two Strings Different from Longest Common Subsequenceproblem. E.g. S1: superiorcalifornialivers S2: sealiver Longest common substring: alive Longest common substring of two strings can be found in linear time using a generalized suffix tree. Find the node with the greatest depth that is marked both 1 and 2. Linear construction time Node marking and calculation of string depth can be done by standard linear tree traversal methods. 1970: Don Knuth conjectured that a linear time algorithm would be impossible.7.5 Recognizing DNA Contamination Given a string S1( the newly isolated and sequenced string of DNA) and a known string S2 ( the combined sources of possible contamination), find all substrings of S2 that occur in S1 and are longer than some given length l. These substrings are candidates of unwanted pieces of S2 that have contaminated the desired DNA string.5Finding common substrings Can be solved in linear time by extending the longest common substring of two strings. Build a generalized suffix tree for S1 and S2. Mark each internal node that has in its subtree a leaf representing a suffix of S1 and also a leaf representing a suffix of S2. Report all marked nodes that have string depth of l or greater.7.6 Common Substrings of more than two sequencesGiven a set of strings, find substrings that are common to a large number of those strings.Formal statement:Given: K strings whose lengths sum to n For each k between 2 and K, we define l(k) to be the length of the longest substring common to at least k of the strings.Example: Strings - {sandollar, sandlot, handler, grand, pantry}l(2) – 4, l(3) – 3, l(4) – 3, l(5) – 267.6: Linear-time Solution Build generalized suffix tree T for all the input strings. For every internal node v of T, define c(v) to be the number of distinct string identifiers that appear at the leaves in the subtree of v.[ It is easy to compute the number of leaf nodes under v; but computing c(v) is complicated by the fact that more than two leaves may have same identifier.] l(k) is the depth of the deepest node v such that c(v) ≥ k7.6: Complexity Counting the number of leaves under an internal node v does not give c(v). Therefore, each internal node v maintains a K-length bit vector. Bit i in the vector is set to 1 if there is at least one leaf under v belonging to string i.  The bit vector for an internal node v can be obtained ORing the bit-vectors of all the children of v. Since there are O(n) edges in the tree, the time needed will be O(Kn) There is a O(n) solution. See Chapter 9.77.10 All-pairs Suffix-Prefix MatchingDefinition:Given two strings Siand Sj, any suffix of Sithat matches a prefix of Sjis called a suffix-prefix match of Si, Sj.Given a collection of strings S = S1, S2,…Sk, the all-pairs suffix-prefix problem is the problem of finding, for each ordered pair Si, Sjin S, the longestsuffix-prefix match of Si, Sj.Motivation: Approximate methods for the shortest superstringproblem.7.10: linear time solution We call an edge terminal edge if it is labeled with only a string termination symbol. Solution: Build a generalized suffix tree T(S) for the kstrings in S. Build list L(v) for each internal node v L(v) contains index i if a terminal edge labeled i is incident on v The deepest node v on the path to leaf j such that i ∈L(v) identifies that longest match between a suffix of Siand a prefix of Si.87.10 (continued…) Traverse T(S) in a depth-first manner Maintain k stacks, one for each string When a node v is reached in forward direction, push v on to the ith stack, for each i ∈L(v). When a leaf j corresponding to the entire string Sjis reached, scan the k stacks and record the current top of each stack. When the depth-first traversal backs up past a node v, pop the top of any stack whose index is in L(v) Complexity:O(m+k2)Importance of Repetitive Structures in molecular strings Over 50% of human genome consists of repeats. Complimentary palindromes regulate transcription (by forming hair-pin loops) Clustered genes that code for similar proteins Pseudogenes Restriction enzyme cutting sites Tandem repeats and tandem arrays9Uses of repetitive structures Genetic mapping


View Full Document

UCF CAP 5937 - Applications of suffix trees

Documents in this Course
Load more
Download Applications of suffix trees
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Applications of suffix trees and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Applications of suffix trees 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?