Algorithms, 4th Edition·Robert Sedgewick and Kevin Wayne·Copyright © 2002–2012·April 18, 2012 6:10:29 AMAlgorithmsFOUR T H EDIT IONR O B E R T S EDG EWICK K EVIN W A Y N E5.3 SUBSTRING SEARCH‣brute force‣Knuth-Morris-Pratt‣Boyer-Moore‣Rabin-Karp2Substring searchGoal. Find pattern of length M in a text of length N.typically N >> MSubstring search N E E D L EI N A H A Y S T A C K N E E D L E I N Amatchpatter ntext3Substring search applicationsGoal. Find pattern of length M in a text of length N.Computer forensics. !Search memory or disk for signatures,e.g., all URLs or RSA keys that the user has entered.typically N >> Mhttp://citp.princeton.edu/memorySubstring search N E E D L EI N A H A Y S T A C K N E E D L E I N Amatchpatter ntext4Substring search applicationsGoal. Find pattern of length M in a text of length N.Identify patterns indicative of spam. • PROFITS • L0SE WE1GHT • herbal Viagra • There is no catch. • This is a one-time mailing. • This message is sent in compliance with spam regulations. typically N >> MSubstring search N E E D L EI N A H A Y S T A C K N E E D L E I N Amatchpatter ntext5Substring search applicationsElectronic surveillance.Need to monitor all internet traffic.(security)No way!(privacy)Well, we’re mainlyinterested in“ATTACK AT DAWN”OK. Build amachine that just looks for that.“ATTACK AT DAWN”substring searchmachinefound6Substring search applicationsScreen scraping. Extract relevant data from web page.Ex. Find string delimited by <b> and </b> after first occurrence ofpattern Last Trade:.http://finance.yahoo.com/q?s=goog...<tr><td class= "yfnc_tablehead1"width= "48%">Last Trade:</td><td class= "yfnc_tabledata1"><big><b>452.92</b></big></td></tr><td class= "yfnc_tablehead1"width= "48%">Trade Time:</td><td class= "yfnc_tabledata1">...7Screen scraping: Java implementationJava library. The indexOf() method in Java's string library returns the index of the first occurrence of a given string, starting at a given offset.public class StockQuote{ public static void main(String[] args) { String name = "http://finance.yahoo.com/q?s="; In in = new In(name + args[0]); String text = in.readAll(); int start = text.indexOf("Last Trade:", 0); int from = text.indexOf("<b>", start); int to = text.indexOf("</b>", from); String price = text.substring(from + 3, to); StdOut.println(price); } }% java StockQuote goog582.93% java StockQuote msft24.848‣brute force‣Knuth-Morris-Pratt‣Boyer-Moore‣Rabin-KarpCheck for pattern starting at each text position.9Brute-force substring searchBrute-force substring search i j i+j 0 1 2 3 4 5 6 7 8 9 10 A B A C A D A B R A C 0 2 2 A B R A 1 0 1 A B R A 2 1 3 A B R A 3 0 3 A B R A 4 1 5 A B R A 5 0 5 A B R A 6 4 10 A B R A entries in gray arefor reference onlyentries in blackmatch the textreturn i when j is Mentries in red aremismatchestxtpatmatchCheck for pattern starting at each text position.public static int search(String pat, String txt){ int M = pat.length(); int N = txt.length(); for (int i = 0; i <= N - M; i++) { int j; for (j = 0; j < M; j++) if (txt.charAt(i+j) != pat.charAt(j)) break; if (j == M) return i; } return N;}10Brute-force substring search: Java implementationindex in text wherepattern startsnot foundi j i+j 0 1 2 3 4 5 6 7 8 9 10 A B A C A D A B R A C4 3 7 A D A C R5 0 5 A D A C RBrute-force algorithm can be slow if text and pattern are repetitive.Worst case. ~ M N char compares.11Brute-force substring search: worst caseBrute-force substring search (worst case) i j i+j 0 1 2 3 4 5 6 7 8 9 A A A A A A A A A B 0 4 4 A A A A B 1 4 5 A A A A B 2 4 6 A A A A B 3 4 7 A A A A B 4 4 8 A A A A B 5 5 10 A A A A B txtpatBrute-force substring search i j i+j 0 1 2 3 4 5 6 7 8 9 10 A B A C A D A B R A C 0 2 2 A B R A 1 0 1 A B R A 2 1 3 A B R A 3 0 3 A B R A 4 1 5 A B R A 5 0 5 A B R A 6 4 10 A B R A entries in gray arefor reference onlyentries in blackmatch the textreturn i when j is Mentries in red aremismatchestxtpatmatchIn many applications, we want to avoid backup in text stream.•Treat input as stream of data.•Abstract model: standard input.Brute-force algorithm needs backup for every mismatch.Approach 1. Maintain buffer of last M characters.Approach 2. Stay tuned.A A A A A A A A A A A A A A A A A A A A A B A A A A A BBackup12“ATTACK AT DAWN”substring searchmachinefoundA A A A A A A A A A A A A A A A A A A A A B A A A A A Bmatched charsmismatchshift pattern right one positionbackupSame sequence of char compares as previous implementation.• i points to end of sequence of already-matched chars in text.• j stores number of already-matched chars (end of sequence in pattern).public static int search(String pat, String txt){ int i, N = txt.length(); int j, M = pat.length(); for (i = 0, j = 0; i < N && j < M; i++) { if (txt.charAt(i) == pat.charAt(j)) j++; else { i -= j; j = 0; } } if (j == M) return i - M; else return N;}13Brute-force substring search: alternate implementationbackupi j 0 1 2 3 4 5 6 7 8 9 10 A B A C A D A B R A C7 3 A D A C R5 0 A D A C R14Algorithmic challenges in substring searchBrute-force is not always good enough.Theoretical challenge. Linear-time guarantee.Practical challenge. Avoid backup in text stream.often no room or time to save textfundamental algorithmic problemNow is the time for all people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good people to come to the aid of their party. Now is the time for all of the good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for each good person to come to the aid of
View Full Document