UT CS 337 - String Matching: Knuth-Morris-Pratt Algorithm - D115162

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 337> String Matching: Knuth-Morris-Pratt Algorithm

DOC PREVIEW

UT CS 337 - String Matching: Knuth-Morris-Pratt Algorithm

School name University of Texas at Austin

Course Cs 337- Theory in Programming Practice

Pages 15

This preview shows page 1-2-3-4-5 out of 15 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

String Matching: Knuth-Morris-PrattAlgorithmGreg PlaxtonTheory in Programming Practice, Spring 2005Department of Computer ScienceUniversity of Texas at AustinSome Notation• We index the symbols in a string starting at 0• For any string s, let s denote the length of s• For any string s and integer i such that 0 ≤ i < s, let s[i] denote thesymbol of s with index i• For any string s and integers i and j such that 0 ≤ i < s and i ≤ j ≤ s,s[i..j] denotes the (possibly empty) substring of s starting at index iand ending just before j– s[2..4] is the two-symbol string s[2]s[3]– s[2..2] is the empty string– s[0..s] = sTheory in Programming Practice, Plaxton, Spring 2005The (Exact) String Matching Problem• Given a text string t and a pattern string p, find all occurrences of p intTheory in Programming Practice, Plaxton, Spring 2005Three Efficient String Matching Algorithms• Rabin-Karp– This is a simple randomized algorithm that tends to run in lineartime in most scenarios of practical interest– The worst case running time is as bad as that of the naive algorithm,i.e., Θ(p · t)• Knuth-Morris-Pratt (this lecture and the next)– The worst case running time of this algorithm is linear, i.e., O(p + t)• Boyer-Moore– This algorithm tends to have the best performance in practice, as itoften runs in sublinear time– The worst case running time is as bad as that of the naive algorithmTheory in Programming Practice, Plaxton, Spring 2005The KMP String Matching Algorithm: Plan• We maintain two indices, ` and r, into the text string• We iteratively update these indices and detect matches such that thefollowing loop invariant is maintained– KMP Invariant: ` ≤ r, t[`..r] = p[0..r − `], and all occurrences ofthe pattern p starting prior to ` in the text t have been detected• We ensure that the invariant holds initially by setting ` and r to zero• Remark: We will see later that the algorithm also requires apreprocessing phase involving only the pattern string pTheory in Programming Practice, Plaxton, Spring 2005Achieving Linear Time Complexity: The Plan• The algorithm performs only a constant amount of computation ineach iteration• The algorithm never decreases ` or r• In each iteration, either ` or r is increased• Note that the indices ` and r are at most t• By the KMP invariant, all matches have been detected once ` reachest, so we can terminate at that point• The preprocessing phase, which involves only p, runs in O(p) timeTheory in Programming Practice, Plaxton, Spring 2005KMP Iteration• Let’s see how to define an iteration of the KMP loop• Assume the KMP invariant holds at the beginning of the iteration• Since the loop has not terminated, ` < t• We’d like to increase ` or r, while maintaining the invariant• There are two cases to consider– Case 1: 0 ≤ r − ` < p, i.e., we do not yet know whether there is amatch starting at index `– Case 2: r − ` = p, i.e., we have found a match starting at index `Theory in Programming Practice, Plaxton, Spring 2005Case 1: 0 ≤ r − ` < p• Case 1.1: t[r] = p[r − `]– We’ve matched another symbol; increment r• Case 1.2: r = ` and t[r] 6= p[r − `]– Our current match is the empty string and the next symbol does notmatch p[0]; increment ` and r• Case 1.3: r > ` and t[r] 6= p[r − `]– Our current match is a nonempty proper prefix of p and the nextsymbol does not extend this match– How should we update ` and r in this remaining subcase?Theory in Programming Practice, Plaxton, Spring 2005Case 1.3: 0 ≤ r − ` < p, r > `, and t[r] 6= p[r − `]• Our current match u is a nonempty proper prefix of p and the nextsymbol does not extend this match• We cannot set ` to r because we might skip over one or more matches– Example: Suppose p is axbcyaxbts and we’ve already matchedaxbcyaxb, but the next symbol is not t– In this example, we advance ` by 5• In general, we advance ` by the smallest k > 0 such that the suffixv = u[k..u] of u is a prefix of p• Note that v is simply the longest string that is both a proper prefix anda proper suffix of u– This string is called the core of u, denoted c(u)– Later we will discuss how the KMP algorithm computes such coresTheory in Programming Practice, Plaxton, Spring 2005Case 2: r − ` = p• We output that a match exists starting at index `• How do we update ` and r?• Note that this case is very similar to Case 1.3 treated earlier• We increase ` by p − c(p)Theory in Programming Practice, Plaxton, Spring 2005Core Computation• It remains only to describe how the KMP algorithm computes the coresrequired in Cases 1.3 and 2• Recall that each iteration of KMP is supposed to run in a constantnumber of operations• How can we hope to compute the core of a string in constant time?Theory in Programming Practice, Plaxton, Spring 2005KMP Core Computation: A Key Observation• Note that in Case 1.3 we need to compute the core of some properprefix of p, while in Case 2 we need to compute the core of p• Thus, if we precompute the core of every prefix of p, we will be able toexecute each iteration of the KMP loop in constant time• It remains to prove that we can compute the core of every prefix of pin O(p) timeTheory in Programming Practice, Plaxton, Spring 2005Some Properties of Core• Let u ¹ v mean that u is both a prefix and a suffix of v– For any string u, ² ¹ u– The ¹ relation is a partial order• Let u ≺ v denote u ¹ v and u 6= v• The core c(v) of a string v is the unique string such that for all stringsuu ¹ c(v) ≡ u ≺ v– It follows, by replacing u with c(v), that c(v) ≺ v and hence c(v) < v• Let c0(u) denote u and for any i ≥ 0 such that ci(v) is a nonemptystring, let ci+1(u) denote c(ci(u))Theory in Programming Practice, Plaxton, Spring 2005A Key Property• Claim: For any u and v, u ¹ v ≡ h∃i : 0 ≤ i : u = ci(v)i• The proof is by induction on the length of v• Base case (v = 0):u ¹ v≡ {v = 0, i.e., v = ²}u = ² ∧ v = ²≡ {definition of c0: v = ² ⇒ ci(v) is defined for i = 0 only}h∃i : 0 ≤ i : u = ci(v)iTheory in Programming Practice, Plaxton, Spring 2005Induction Step: v = n + 1, n ≥ 0u ¹ v≡ {definition of ¹}u = v ∨ u ≺ v≡ {definition of core}u = v ∨ u ¹ c(v)≡ {c(v) < v; induction hypothesis on second term}u = v ∨ h∃i : 0 ≤ i : u = ci(c(v))i≡ {rewrite}u = c0(v) ∨ h∃i : 0 < i : u = ci(v)i≡ {rewrite}h∃i : 0 ≤ i : u =

View Full Document