Algorithms in Java, 4th Edition·Robert Sedgewick and Kevin Wayne·Copyright © 2008·April 14, 2008 9:38:41 AMPattern MatchingReferences: Algorithms in C (2nd edition), Chapter 19 (pdf online) http://www.cs.princeton.edu/algs4/63long http://www.cs.princeton.edu/algs4/72regular‣exact pattern matching‣Knuth-Morris-Pratt‣RE pattern matching‣grep2‣exact pattern matching‣Knuth-Morris-Pratt‣RE pattern matching‣grep3Exact pattern matchingGoal. Find pattern of length M in a text stream of length N.Computer forensics. !Search memory or disk for signatures,e.g., all URLs or RSA keys that the user has entered.typically N >> Mpatterntextneedleinahaystackaneedleinahttp://citp.princeton.edu/memory4Applications•Parsers.•Spam filters. •Digital libraries.•Screen scrapers.•Word processors.•Web search engines.•Natural language processing.•Computational molecular biology.•Feature detection in digitized images.5Spam filteringIdentify patterns indicative of spam.•PROFITS •AMAZING •GUARANTEE •L0SE WE1GHT •herbal Viagra •There is no catch. •L0W M0RTGAGE RATES •This is a one-time mailing. •This message is sent in compliance with spam regulations. •You're getting this message because you registered with one of our marketing partners.6Screen scrapingGoal. Extract relevant data from web page.Ex. Find string delimited by <b> and </b> after first occurrence ofpattern Last Trade:.http://finance.yahoo.com/q?s=goog...<tr><td class= "yfnc_tablehead1"width= "48%">Last Trade:</td><td class= "yfnc_tabledata1"><big><b>452.92</b></big></td></tr><td class= "yfnc_tablehead1"width= "48%">Trade Time:</td><td class= "yfnc_tabledata1">...7Exact pattern matching in JavaThe method s.indexOf(pattern, offset) in Java's String library returns the index of the first occurrence of pattern in string s, starting at given offset.public class StockQuote{ public static void main(String[] args) { String name = "http://finance.yahoo.com/q?s="; In in = new In(name + args[0]); String input = in.readAll(); int start = input.indexOf("Last Trade:", 0); int from = input.indexOf("<b>", start); int to = input.indexOf("</b>", from); String price = input.substring(from + 3, to); StdOut.println(price); } }% java StockQuote goog452.92% java StockQuote msft28.152Check for pattern starting at each text position.8Brute-force exact pattern matchh a y n e e d s a n n e e d l e xn e e d l en e e d l en e e d l en e e d l en e e d l en e e d l en e e d l en e e d l en e e d l en e e d l en e e d l eCheck for pattern starting at each text position.public static int search(String pattern, String text){ int M = pattern.length(); int N = text.length(); for (int i = 0; i < N - M; i++) { int j; for (j = 0; j < M; j++) if (text.charAt(i+j) != pattern.charAt(j)) break; if (j == M) return i; } return -1; }9Brute-force exact pattern match: Java implementationindex in text where pattern startsnot foundBrute-force algorithm can be slow if text and pattern are repetitive.Worst case. ~ MN char compares.10Brute-force exact pattern match: worst casea a a a a a a a a a a a a a a a ba a a a a ba a a a a ba a a a a ba a a a a ba a a a a ba a a a a ba a a a a ba a a a a ba a a a a ba a a a a ba a a a a ba a a a a b11Algorithmic challenges in pattern matching Brute-force is not good enough for all applications.Theoretical challenge. Linear-time guarantee.Practical challenge. Avoid backup in text stream.often no room or time to save textfundamental algorithmic problemNow is the time for all people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good people to come to the aid of their party. Now is the time for all of the good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for each good person to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many or all good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Democrats to come to the aid of their party. Now is the time for all people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good people to come to the aid of their party. Now is the time for all of the good people to come to the aid of their party. Now is the time for all good people to come to the aid of their attack at dawn party. Now is the time for each person to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many or all good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Democrats to come to the aid of their party.12‣exact pattern matching‣Knuth-Morris-Pratt‣RE pattern matching‣grepKnuth-Morris-Pratt exact pattern-matching algorithmKMP. Classic algorithm that meets both challenges.•Linear-time guarantee.•No backup in text stream.Basic plan (for binary alphabet).•Build DFA from pattern.•Simulate DFA with text as input.13Don KnuthVaughan PrattJim MorrisDFAforpatterna a b a a acceptpattern in textrejectpattern NOT in texttexta a a b a a b a a a bDFA review.•Finite number of states (including start and accept).•Exactly one transition for each input symbol.•Accept if sequence of transitions leads to accept state.Q. Which bitstrings does this DFA accept?14Deterministic finite-state automataOne state for each pattern character.•Match input character: move from i to i+1.•Mismatch: move to previous state.Knuth-Morris-Pratt
View Full Document