String Matching (Chap. 32)An example of string matchingNotation and terminologyGraphical Proof of Lemma 32.1Naïve string matchingProblem with naïve algorithmKnuth-Morris-Pratt (KMP) algorithmPrefix functionSlide 9Slide 10Slide 11Analysis of KMP algorithmString Matching (Chap. 32)•Given a pattern P[1..m] and a text T[1..n], find all occurrences of P in T. Both P and T belong to *.•P occurs with shift s (beginning at s+1): P[1]=T[s+1], P[2]=T[s+2],…,P[m]=T[s+m].•If so, call s is a valid shift, otherwise, an invalid shift.• Note: one occurrence begins within another one: P=abab, T=abcabababbc, P occurs at s=3 and s=5.An example of string matchingNotation and terminology•w is a prefix of x, if x=wy for some y*. Denoted as wx.•w is a suffix of x, if x=yw for some y*. Denoted as wx.•Lemma 32.1 (Overlapping shift lemma):–Suppose x,y,z and xz and yz, then if |x||y|, then xy; if |x| |y|, then yx; if |x| = |y|, then x=y.Graphical Proof of Lemma 32.1Naïve string matchingRunning time: O((n-m+1)m).Problem with naïve algorithm•Problem with Naïve algorithm:–Suppose p=ababc, T=cabababcd.•T: c a b a b a b c d•P: a …•P: a b a b c•P: a…•P: a b a b c•Whenever a character mismatch occurs after matching of several characters, the comparison begins by going back in T from the character which follows the last beginning character.•Can we do better: not go back in T?Knuth-Morris-Pratt (KMP) algorithm•Idea: after some character (such as q) matches of P with T and then a mismatch, the matched q characters allows us to determine immediately that certain shifts are invalid. So directly go to the shift which is potentially valid.•The matched characters in T are in fact a prefix of P, so just from P, it is OK to determine whether a shift is invalid or not. •Define a prefix function , which encapsulates the knowledge about how the pattern P matches against shifts of itself. :{1,2,…,m}{0,1,…,m-1}[q]=max{k: k<q and Pk Pq}, that is [q] is the length of the longest prefix of P that is a proper suffix of Pq.Prefix functionIf we precompute prefix function of P (against itself), then whenevera mismatch occurs, the prefix functioncan determine which shift(s) are invalidand directly ruled out. So move directlyto the shift which is potentially valid.However, there is no need to comparethese characters again since they areequal.Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.Analysis of KMP algorithm•The running time of COMPUTE-PREFIX-FUNCTION is (m) and KMP-MATCHER (m)+ (n).•Using amortized analysis (potential method) (for COMPUTE-PREFIX-FUNCTION):–Associate a potential of k with the current state k of the algorithm:–Consider codes in Line 5 to 9. –Initial potential is 0, line 6 decreases k since [k]<k, k never becomes negative.–Line 8 increases k at most 1.–Amortized cost = actual-cost + potential-increase–=(repeat-times-of-Line-5+O(1))+(potential-decrease-at-least the repeat-times-of-Line-5+O(1) in line
View Full Document