CMSC423 Bioinformatic Algorithms Databases and Tools Lecture 7 Exact string matching Suffix trees Suffix arrays Basic idea 1 D dynamic programming Can Z i be computed with the help of Z j for j i i j 1 j Z j i Assume there exists j i s t j Z j 1 i then Z i j 1 provides information about Z i If there is no such j simply compare characters T i to T 0 since they have not been seen before CMSC423 Fall 2008 2 Three cases Let j i be the coordinate that maximizes j Z j 1 intuitively the Z j that extends the furthest I Z i j 1 Z j i j 1 Z i Z i j 1 i j 1 A j C C i Z j II Z i j 1 Z j i j 1 Z i Z j i j 1 i j 1 j A Z j i C III Z i j 1 Z j i j 1 Z i compare from i Z i j 1 i j 1 j CMSC423 Fall 2008 Z j i 3 Time complexity analysis Why do these tricks save us time 1 Cases I and II take constant time per Z value computed total time spent in these cases is O n 2 Case III might involve 1 or more comparisons per Z value however every successful comparison match shifts the rightmost character that has been visited every unsuccessful comparison terminates the round and algorithm moves on to the next Z value total time spent in III cannot be more than of characters in the text Overall running time is O n CMSC423 Fall 2008 4 Space complexity If using Z algorithm for matching how many Z values do we need to store PPPPPPPPPP TTTTTTTTTTTTTTTTTTTTTTTT Only need to remember Z values for pattern and the farthest reaching Z value Z j in what we discussed before CMSC423 Fall 2008 5 Z algorithm not just for matching Lempel Ziv compression e g gzip Z i i i Z i 1 n if Z i 0 just send store the character T i otherwise instead of sending T i i Z i 1 Z i 1 characters bytes simply send Z i one number Note other exact matching algorithms used for data compression e g Burrows Wheeler transform relates to suffix arrays CMSC423 Fall 2008 6 Knuth Morris Pratt algorithm Given a Pattern and a Text preprocess the Pattern to compute sp i length of longest prefix of P that matches a suffix of P 0 i P sp i i j T A C P i P Compare P with T until finding a mis match at coordinate i 1 in P and j 1 in T Shift P such that first sp i characters match T j sp i 1 j Continue matching from T i 1 P sp i 1 CMSC423 Fall 2008 7 Boyer Moore algorithm Preprocess the pattern computing for every i L i largest coordinate n s t P i n matches a suffix of P 1 L i inverted Z function P L i T i j A P A C i P C Match the pattern backwards starting at the right until mismatch Shift the pattern such that P L i n i 1 matches at T j Repeat Bad character rule find character T j 1 in P and shift until it matches Choose the longest shift btwn suffix char rules CMSC423 Fall 2008 8 Suffix trees CMSC423 Fall 2008 9 Intro to suffix trees Used in fast exact matching Basic idea extend a trie structure for storing multiple strings w t h a e i r CMSC423 Fall 2008 s h e their there was when n r e 10 Suffix tree Extends trie of all suffixes of a string 1 ATCATG 2 TCATG 3 CATG AT 4 ATG G T 5 TG 6 G CATG 6 CATG G G 4 CMSC423 Fall 2008 1 5 3 CATG 2 11 Suffix tree cont To store in linear time just store range in sequence instead of string To ensure suffixes end at leaves add char at end of string ATCATG 7 7 AT 1 2 CATG 3 7 T 2 2 G 6 7 6 CATG 3 7 G 6 7 4 CMSC423 Fall 2008 1 7 3 G 6 7 5 CATG 3 7 2 12 Suffix links Link every node labeled aS for some string S to node labeled S note it always exists 7 7 AT 1 2 CATG 3 7 T 2 2 G 6 7 6 CATG 3 7 G 6 7 4 CMSC423 Fall 2008 1 7 3 G 6 7 5 CATG 3 7 2 13 Suffix trees for matching Suffix trees use O n space Suffix trees can be constructed in O n time Is CAT part of ATCATG Match from root char by char If run out of query found match AT otherwise there is no match 1 2 7 7 CATG 3 7 T 2 2 G 6 7 6 intuition CAT is the prefix of some suffix CATG 3 7 G 6 7 4 CMSC423 Fall 2008 1 7 3 G 6 7 5 CATG 3 7 2 14 Suffix links useful for substring matches Does any part of AGATG match string AGCAGT 7 7 AG 1 2 CAGT 3 7 G 2 2 T 6 7 6 CAGT 3 7 T 6 7 4 CMSC423 Fall 2008 1 7 3 T 6 7 5 CAGT 3 7 2 15
View Full Document
Unlocking...