New version page

# UCF CAP 5937 - Applications of Suffix Trees

Pages: 28
Documents in this Course

6 pages

14 pages

80 pages

15 pages

10 pages

11 pages

15 pages

21 pages

11 pages

47 pages

23 pages

19 pages

2 pages

Unformatted text preview:

Applications of Suffix Trees7.1 Exact String Matching7.2 Exact Set MatchingSlide 47.3 Substring Problem for a Database of PatternsSlide 67.4. Longest Common Substring for Two Strings7.5 Recognizing DNA ContaminationFinding common substrings7.6 Common Substrings of more than two sequences7.6: Linear-time Solution7.6: Complexity7.10 All-pairs Suffix-Prefix Matching7.10: linear time solution7.10 (continued…)Importance of Repetitive Structures in molecular stringsUses of repetitive structuresFinding all maximal repetitive structuresMaximal repeated pairsMore definitionsUsing suffix trees to find maximal repeatsFinding maximal repeats: DefinitionsFinding left diverse nodes in linear timeFinding all maximal repeats in linear timeFinding Supermaximal repeats in linear timeFinding super-maximal repeatsFinding supermaximal repeats in linear time7.13: Circular string linearizationApplications of Suffix TreesDr. Amar MukherjeeCAP 5937 – ST: BioinformaticsUniversity of central Florida7.1 Exact String MatchingThree important variants:Both P (|P|=n) and T (|T|=m) are known: Suffix tree method achieves same worst-case bound O(n+m) as KMP or BM.T is fixed and build suffix tree, then P is input:k: number of occurrences of PUsing suffix tree: O(n+k)In contrast ( preprocess P): O(n+m) for any single PP is fixed, then T is inputSelect KMP or BM rather than suffix tree.7.2 Exact Set MatchingBoth Aho-Corasick and Suffix methods find all occurrences of P in T in O(n+m+k). But have preference by case.Comparison:AC: build keyword tree: size O(n), time O(n).When set of patterns is larger than T, suffix tree approach uses less space, but more time to search.E.g. molecular biology, where pattern library is large.When total size of patterns is smaller than T, AC method use less space. But suffix tree uses less time.Neither method is superior in time and space.One case where suffix tree is better, see Application 8.Time/space trade-off remains, but suffix tree can be used for chosen time/space combinations, whereas no choice for keyword tree.7.3 Substring Problem for a Database of PatternsThe most interesting version:A set of string, or a database, is known and fixed. A sequence of strings will be presented. For each presented string S, find all the strings in the database containing S as a substring.The total length of all the strings, m, in the database is assumed to be large.In the context of genomic DNA data, the problem of finding substring cannot be solved by exact set matching.Suffix tree solution:A generalized suffix tree is built for database in O(m) time and O(m) space. Any single string S of length n can be found or declared not to be there in O(n) time.If S matches a path in the tree.A full string is in S iff matching path reaches a leaf when last symbol of S is examined.Find all occurrences containing S as substring in O(n+k) time by traversing subtree below where S is found.7.4. Longest Common Substring for Two StringsDifferent from Longest Common Subsequence problem.E.g. S1: superiorcalifornialivers S2: sealiverLongest common substring: aliveLongest common substring of two strings can be found in linear time using a generalized suffix tree.Find the node with the greatest depth that is marked both 1 and 2.Linear construction timeNode marking and calculation of string depth can be done by standard linear tree traversal methods.1970: Don Knuth conjectured that a linear time algorithm would be impossible.7.5 Recognizing DNA ContaminationGiven a string S1( the newly isolated and sequenced string of DNA) and a known string S2 ( the combined sources of possible contamination), find all substrings of S2 that occur in S1 and are longer than some given length l. These substrings are candidates of unwanted pieces of S2 that have contaminated the desired DNA string.Finding common substringsCan be solved in linear time by extending the longest common substring of two strings.Build a generalized suffix tree for S1 and S2.Mark each internal node that has in its subtree a leaf representing a suffix of S1 and also a leaf representing a suffix of S2.Report all marked nodes that have string depth of l or greater.7.6 Common Substrings of more than two sequencesGiven a set of strings, find substrings that are common to a large number of those strings.Formal statement:Given: K strings whose lengths sum to nFor each k between 2 and K, we define l(k) to be the length of the longest substring common to at least k of the strings.Example: Strings - {sandollar, sandlot, handler, grand, pantry}l(2) – 4, l(3) – 3, l(4) – 3, l(5) – 27.6: Linear-time SolutionBuild generalized suffix tree T for all the input strings.For every internal node v of T, define c(v) to be the number of distinct string identifiers that appear at the leaves in the subtree of v.[ It is easy to compute the number of leaf nodes under v; but computing c(v) is complicated by the fact that more than two leaves may have same identifier.]l(k) is the depth of the deepest node v such that c(v)  k7.6: ComplexityCounting the number of leaves under an internal node v does not give c(v).Therefore, each internal node v maintains a K-length bit vector. Bit i in the vector is set to 1 if there is at least one leaf under v belonging to string i.  The bit vector for an internal node v can be obtained ORing the bit-vectors of all the children of v. Since there are O(n) edges in the tree, the time needed will be O(Kn)There is a O(n) solution. See Chapter 9.7.10 All-pairs Suffix-Prefix MatchingDefinition:Given two strings Si and Sj, any suffix of Si that matches a prefix of Sj is called a suffix-prefix match of Si, Sj.Given a collection of strings S = S1, S2,…Sk, the all-pairs suffix-prefix problem is the problem of finding, for each ordered pair Si, Sj in S, the longest suffix-prefix match of Si, Sj.Motivation:Approximate methods for the shortest superstring problem.7.10: linear time solutionWe call an edge terminal edge if it is labeled with only a string termination symbol.Solution:Build a generalized suffix tree T(S) for the k strings in S.Build list L(v) for each internal node vL(v) contains index i if a terminal edge labeled i is incident on vThe deepest node v on the path to leaf j such that i L(v) identifies that longest match between a suffix of Si and a prefix of Si.7.10

View Full Document