Robert Sedgewick and Kevin Wayne • Copyright © 2005 • http://www.Princeton.EDU/~cos226Regular ExpressionsReference: Chapter 7.1-7.4, Introduction to Computer Science, R. Sedgewick and K. Wayne.2Pattern MatchingString search. Search for given string in a large text file.Regular expression.! Natural and compact way to express multiple text patterns.! Quintessential programmer's tool.Ex. Fragile X syndrome is a common cause of mental retardation.! Human genome contains triplet repeats of CGG or AGG, starting withGCG and ending with CTG.! Number of repeats is variable, and correlated with syndrome.! Use regular expression to specify pattern: GCG(CGG|AGG)*CTG.3Pattern Matching ApplicationsTest if a string matches some pattern.! Process natural language.! Scan for virus signatures.! Search for information using Google.! Access information in digital libraries.! Retrieve information from Lexis/Nexis.! Search-and-replace in a word processors.! Filter text (spam, NetNanny, Carnivore, malware).! Validate data-entry fields (dates, email, URL, credit card).! Search for markers in human genome using PROSITE patterns.Parse text files.! Compile a Java program.! Crawl and index the Web.! Read in data stored in ad hoc input file format.! Automatically create Java documentation from Javadoc comments.4Regular Expressions: Basic OperationsRegular expression. Notation to specify a set of strings.every other stringaabaabaabaabConcatenationevery other stringaaaababaaba(a|b)aabParentheses(ab)*aab*aaa | baab.u.u.u.Regular ExpressionaaabbbaaababababaabababaaaabbbaClosureUnionWildcardOperationevery other stringaabaabsuccubustumultuouscumulusjugulumNoYes5Regular Expressions: ExamplesRegular expression. Notation is surprisingly expressive.bbbbaabbbaabbbaaabbbaababbaaa* | (a*ba*ba*ba*)*multiple of three b’s111111111403982772100023498701234.*0....fifth to last digit is 0subspacesubspeciesraspberrycrispbread.*spb.*contains the trigraph spbgcgcggcggcggcggctggcgcaggctggcgctggcgcggctggcgcggaggctggcg(cgg|agg)*ctgfragile X syndrome indicatorRegular Expression NoYes6Generalized Regular ExpressionsGeneralized regular expressions.! Additional operations typically added for convenience.! Ex: [a-e]+ is shorthand for (a|b|c|d|e)(a|b|c|d|e)*.111111111166-54-11108540-132119072-5541[0-9]{5}-[0-9]{4}Exactly kdecaderhythm[^aeiou]{6}NegationscamelCase4illegalwordCapitalized[A-Za-z][a-z]*Character classesadebcdeabcdeabcbcdea(bc)+deOne or moreRegular ExpressionOperation NoYes7Regular Expressions in JavaValidity checking. Is input in the set described by the re?public class Validate { public static void main(String[] args) { String re = args[0]; String input = args[1]; System.out.println(input.matches(re)); }}% java Validate "..oo..oo." bloodroottrue% java Validate "[$_A-Za-z][$_A-Za-z0-9]*" ident123true% java Validate "[a-z]+@([a-z]+\.)+(edu|com)" [email protected]% java Validate "[0-9]{3}-[0-9]{2}-[0-9]{4}" 166-11-4433truelegal Java identifiervalid email address (simplified)Social Security numberneed help solving crosswords?8Regular Expressions in Other LanguagesBroadly applicable programmer's tool.! Many languages support extended regular expressions.! Built into grep, awk, emacs, Perl, PHP, Python, JavaScript.PERL. Practical Extraction and Report Language.print all lines containing NEWLINE whichoccurs in any file with a .java extensiongrep NEWLINE */*.javaegrep '^[qwertyuiop]*[zxcvbnm]*$' dict.txt | egrep '...........'replace all occurrences of fromwith to in the file input.txtperl -p -i -e 's|from|to|g' input.txtperl -n -e 'print if /^[A-Za-z][a-z]*$/' dict.txtdo for each line9Regular Expression CaveatsWriting a RE is like writing a program.! Need to learn syntax.! Can be easier to write than read."Sometimes you have a programming problemand it seems like the best solution is to useregular expressions; now you have two problems."10Perl RE for Valid RFC822 Email Addresses(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[
View Full Document