Unformatted text preview:

6.897: Advanced Data Structures Spring 2003Lecture 10 — March 31, 2003Prof. Erik Demaine Scribe: Michael Walfish1 OverviewIn this lecture we study string matching and related problems. There are several famous algorithmsfor string matching, and we briefly review some of them at the beginning of the lecture. Since thisis a data structures course, we will focus on an approach that involves pre-processing the text tobuild a very useful data structure: the suffix tree.After showing how to use suffix trees to solve string matching, we show how to use this datastructure to solve a related problem: document retrieval. In order to solve document retrieval, weneed to solve other algorithmic and data structures problems, including the Range Minimum Query(RMQ) problem and the Least Common Ancestor (LCA) problem.The lecture concludes with the application of LCA to suffix trees to solve several problems relatedto string matching.2 String Matching – Problem DefinitionWe can state this classic problem in algorithmic terms as follows: given a text T and a pattern P ,each of which is a string over an alphabet Σ, find all or some occurrences of P in T or declare thatthere are no matches.Before we turn to suffix trees, a solution with a heavy data structures component, we briefly reviewsome classic solutions to string matching.3 String Matching – Classic Algorithmic SolutionsThere are a number of existing solutions that range from the brute-force deterministic to the veryelegant randomized.3.1 Brute ForceThis is the obvious approach: begin at the first position in T, and see whether P matches byindexing through both strings. If P doesn’t match, then look at the second position of T , resetthe index in P and check whether P matches by indexing through both strings again. The runningtime of this approach is O(|T | · |P |). If |P | is not constant—if |P | is the same order complexity as|T |, for example—then this approach could become quadratic.1Knowing that we cannot do better than linear time (since the “needle” could be anywhere in the“haystack”), we are motivated to search for linear solutions.3.2 Knuth-Morris-Pratt [KMP77]This algorithm runs in O(|T |), which is the best one could hope for in order complexity terms. Theessence of the algorithm is that P becomes pre-processed into a finite automaton. The algorithmis complicated, and we won’t cover it here, but students are encouraged to look it up if they havenever seen it before.3.3 Boyer-Moore [BM77]This algorithm runs in O(|T |) but sometimes runs in O(|T |/|P |), if things work out. The idea is tokeep a sliding window onto T but to try matching from the back of P to the front. The algorithmlooks at letters that do match to figure out how much to advance the sliding window. If thealgorithm is told that the back does not match, then in some cases, it may be possible to advanceas many as |P | spaces forward. An amortized analysis yields the running times just mentioned.3.4 Rabin-Karp [KR87]This is an elegant randomized algorithm that runs in O(|T |) with high probability (usually ab-breviated “whp” and defined below). The idea is to make a numerical fingerprint of P , wherethe fingerprint is taken modulo a prime, p. The algorithm maintains a sliding window on T andcompares the numeric fingerprint of the current window on T to the pre-computed fingerprint ofP . If these fingerprints match, the algorithm declares a match. Updating the numeric fingerprintwhen the window slides takes constant time, so the entire algorithm is O(|T |). Failure occurs if thetwo fingerprints are equal modulo the prime but in fact represent different strings.Why does it make sense to use a randomized algorithm if there are deterministic ones that providethe same order complexity with zero probability of failure? The reason is that the randomized oneis often easier both to implement and reason about, and, if its running time can be stated to be“whp”, then the chance of failure is negligible.Digress to define “with high probability”: An event E is said to occur with high probability(whp) if:Pr{E} ≥ 1 −1ncfor all c ≥ 1, where n is the problem size. In our case, E is the event “the running time of Rabin-Karp is O(|T |)”, n is |T | + |P |, and c should be able to take on any value without changing theorder complexity (at worst, different values of c should imply different constants absorbed by theO(·)). In the case of Rabin-Karp, the constant c determines the size of the prime modulus p, overwhich the algorithm takes fingerprints (since, the larger the prime, the lower the probability thattwo fingerprints are going to be confused modulo that prime). Different values of c do not affect2the running time or order complexity of the space required to run the algorithm1. In our case, Efailing means that the algorithm either returns a wrong answer or that it does not finish on time.The important thing about “whp” is that it can mean “as likely as you want”, since the c term canbe whatever you want.4 String Matching – Data Structures ApproachThe above solutions all assumed that we could pre-process P but not T . We will now alter thatassumption and permit ourselves to pre-process T . Our approach relies on suffix trees, a powerfuland flexible data structure on which we will spend a good portion of the lecture. Below, we describewhat a suffix tree is, how it can be used for string matching, and its utility in several other problems,including document retrieval.4.1 Suffix Trees – DescriptionSuffix trees have come up a number of times in the literature [Wei73, McC76, Ukk95, GK97]. Asuffix tree is a kind of trie. Let’s first review tries.Description 1. A trie is a tree in which each node has children labeled by letters in Σ.aab{aa, ab}Figure 1: Trie storing strings aa and abThere is a very natural correspondence between tries and sets of strings over Σ, as shown in Figure1. To avoid ambiguities that result from situations in which one of the members of the set is a prefixof another member of the set, we introduce a special symbol not in Σ: $. So we would representthe set {a$, aa$} with a trie as in Figure 2.Note that every leaf is connected to the rest of the trie by the $ symbol. Note, also, that thisapproach is wasteful: if a path does not branch, then we can store it as a single edge. This leadsus to another description:Description 2. A compressed trie is a trie in which non-branching paths are stored as singleedges and


View Full Document

MIT 6 897 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?