DOC PREVIEW
CMU CS 15499 - Algorithms and Applications

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

15-499: Algorithms and ApplicationsExact String MatchingKnuth-Morris-Pratt algorithmSlide 4Slide 5Preprocessing the patternSlide 7Suffix TreesSlide 9Constructing Suffix treesSlide 11Slide 12Ukkonen’s linear-time algorithmGoing from Ti to Ti+1Idea #1 : Suffix LinksSlide 16Suffix Links – Bounding the timeMaintaining Suffix LinksGoing from O(m2) to O(m)Idea #2 : Getting rid of Rule 3Idea #3 : Fast-forwarding Rules 1 & 2Slide 22An exampleComplexity AnalysisExtending to multiple listsMultiple lists – Better algorithmLongest Common SubstringCommon substrings of M stringsLempel-Ziv compression15-499 Page 115-499: Algorithms and ApplicationsComputational Biology – VI - Exact string matching - Suffix Trees - Other apps15-499 Page 2Exact String Matching•Given a text T and pattern P•“Quickly” find an occurrence (or all occurrences) of P in T•A Naïve solution:Compare P with T[i…i+n] for all i --- O(nm) time•How about O(n+m) time?15-499 Page 3Knuth-Morris-Pratt algorithm•An O(n) preprocessing and O(m) time solution•Want to search sequentially through T spending O(1) time on every characterT x y a b c x a b c x a b c d e f ea b c x a b c d eP115-499 Page 4Knuth-Morris-Pratt algorithm•An O(n) preprocessing and O(m) time solution•Want to search sequentially through T spending O(1) time on every characterT x y a b c x a b c x a b c d e f eP a b c x a b c d e1 215-499 Page 5Knuth-Morris-Pratt algorithm•An O(n) preprocessing and O(m) time solution•Want to search sequentially through T spending O(1) time on every characterT x y a b c x a b c x a b c d e f eThe match fails. Where should we restart?P a b c x a b c d e1 2 3 4 5 6 7 8 910Ideally restart from the 4th position of the pattern15-499 Page 6Preprocessing the pattern•l(i) – the largest integer such that P[ i-l(i) .. i-1 ] = P[ 0 .. l(i)-1 ]•Thus l(7) = 1 and l(8) = 2•Can compute these values in linear time by going through the characters sequentially.a b c x a b c d e0 0 0 0 0 1 2 3 0•At each position store the index of the position where we can resume the matching15-499 Page 7Knuth-Morris-Pratt algorithm•O(n+m) algorithm – O(n) preprocessing + O(m) matching•returns all occurrences•What if we want to search for many patterns in the same text?•O(n+m) time per pattern => O(km) time•Can we do it faster? In O(m+kn) time?15-499 Page 8Suffix Trees•Preprocess the text in O(m) time and search in O(n) time•Idea: –Construct a tree containing all suffixes of text along the paths from the root to the leaves–For search, just follow the appropriate path15-499 Page 9Suffix Treesx a b x a ccabaxccccaxbA suffix tree for the string x a b x a c365241Search for the string a b x15-499 Page 10Constructing Suffix trees•Naive O(m2) algo•For every i, add the suffix S[i .. m] to the current treecaxb3abaxc2x a b x a c115-499 Page 11Constructing Suffix trees•Naive O(m2) algo•For every i, add the suffix S[i .. m] to the current treex a b x a ccabaxccaxb324115-499 Page 12Constructing Suffix trees•Naive O(m2) algo•For every i, add the suffix S[i .. m] to the current treex a b x a ccabaxcccaxb3c6524115-499 Page 13Ukkonen’s linear-time algorithm•We will start with an O(m3) algorithm and then give a series of improvements•In stage i, we construct a suffix tree Ti for S[1..i]•Incrementing Ti to Ti+1 naively takes O(i2) time because we insert each of the i suffixes•Thus a total of O(m3) time15-499 Page 14Going from Ti to Ti+1•In the jth substage of stage i+1, we insert S[j..i+1] into Ti. Let S[j..i] = •Three cases–Rule 1: The path  ends on a leaf ) add S[i+1] to the label of the last edge–Rule 2: The path  continues with characters other than S[i+1] ) create a new leaf node and split the path labeled –Rule 3: A path labeled  S[i+1] already exists ) do nothing.15-499 Page 15Idea #1 : Suffix Links•In each substage, we first search for some string in the tree and then insert a new node/edge/label•Can we speed up looking for strings in the tree?•In any substage, we look for a suffix of the strings searched in previous substages•Idea: Put a pointer from an internal node labeled x to the node labeled •Such a link is called a “Suffix Link”15-499 Page 16Idea #1 : Suffix Linksx a b x a ccabaxcccaxbcdSP 1Add the letter ddSP 2dSP 3dSP 4dSP 5dSP 615-499 Page 17Suffix Links – Bounding the time•Steps in each substage–Go up 1 link to the nearest internal node–Follow a suffix link to the suffix node–Follow the path for the remaining string•First two steps together make up O(m) in each stage•The third step follows only as many links as the length of the string S[1 .. i]•Thus the total time per stage is O(m)15-499 Page 18Maintaining Suffix Links•Whenever a node labeled x is created, in the following substage a node labeled  is created.Why?•When a new node is created, add a suffix link from it to the root, and if required, add a suffix link from its predecessor to it.15-499 Page 19Going from O(m2) to O(m)•Size of the tree itself can be O(m2)•Can we even hope to do better than O(m2)?•But notice that there are only 2m edges! – Why?•Idea: represent labels of edges as intervals•Can easily modify the entire process to work on intervals15-499 Page 20Idea #2 : Getting rid of Rule 3•Recall Rule 3: A path labeled S[j .. i+1] already exists ) do nothing.•If S[j .. i+1] already exists, then S[j+1 .. i+1] exists too and we will again apply Rule 3 in the next substage•Whenever we encounter Rule 3, this stage is over – skip to the next stage.15-499 Page 21Idea #3 : Fast-forwarding Rules 1 & 2•Rule 1 applies whenever a path ends in a leaf•Note that a leaf node always stays a leaf node – the only change is to append the new character to its edge using Rule 1•An application of Rule 2 in substage j creates a new leaf node This node is then accessed using Rule 1 in substage j in all the following stages15-499 Page 22Idea #3 : Fast-forwarding Rules 1 & 2•Fast-forward Rule 1 and 2–Whenever Rule 2 creates a node, instead of labeling the last edge with only one character, implicitly label it with the entire remaining suffix•Each leaf edge is labeled only once!15-499 Page 23An examplexcabbcbcx a b x a caxxxaaacccLeaf edge labels are updated by using a variable to denote the end of the interval14256315-499 Page 24Complexity Analysis•Rule 3 is used only once in every stage•For every


View Full Document

CMU CS 15499 - Algorithms and Applications

Documents in this Course
Load more
Download Algorithms and Applications
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Algorithms and Applications and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Algorithms and Applications 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?