Algorithms, 4th Edition·Robert Sedgewick and Kevin Wayne·Copyright © 2002–2011·April 23, 2012 5:10:27 AMAlgorithmsFOUR T H EDIT IONR O B E R T S EDG EWICK K EVIN W A Y N E5.4 REGULAR EXPRESSIONS‣regular expressions‣REs and NFAs‣NFA simulation‣NFA construction‣applications2‣regular expressions‣NFAs‣NFA simulation‣NFA construction‣applications3Pattern matchingSubstring search. Find a single string in text.Pattern matching. Find one of a specified set of strings in text.Ex. [genomics] •Fragile X syndrome is a common cause of mental retardation.•Human genome contains triplet repeats of CGG or AGG,bracketed by GCG at the beginning and CTG at the end.•Number of repeats is variable, and correlated with syndrome.patterntextGCG(CGG|AGG)*CTGGCGGCGTGTGTGCGAGAGAGTGGGTTTAAAGCTGGCGCGGAGGCGGCTGGCGCGGAGGCTG4Pattern matching: applicationsTest if a string matches some pattern.•Process natural language.•Scan for virus signatures.•Specify a programming language.•Access information in digital libraries. •Search genome using PROSITE patterns.•Filter text (spam, NetNanny, Carnivore, malware).•Validate data-entry fields (dates, email, URL, credit card)....Parse text files.•Compile a Java program.•Crawl and index the Web.•Read in data stored in ad hoc input file format.•Create Java documentation from Javadoc comments....5Regular expressionsA regular expression is a notation to specify a (possibly infinite) set of strings.a “language”operationexample REmatchesdoes not matchconcatenationAABAABAABAABevery other stringorAA | BAABAABAABevery other stringclosureAB*AAAABBBBBBBBAABABABAparenthesesA(A|B)AABAAAABABAABevery other stringparentheses(AB)*AAABABABABABAAAABBA6Regular expression shortcutsAdditional operations are often added for convenience.Ex. [A-E]+ is shorthand for (A|B|C|D|E)(A|B|C|D|E)*operationexample REmatchesdoes not matchwildcard.U.U.U.CUMULUSJUGULUMSUCCUBUSTUMULTUOUScharacter class[A-Za-z][a-z]*wordCapitalizedcamelCase4illegalat least 1A(BC)+DEABCDEABCBCDEADEBCDEexactly k[0-9]{5}-[0-9]{4}08540-132119072-5541111111111166-54-111complement[^AEIOU]{6}RHYTHMDECADE7Regular expression examplesNotation is surprisingly expressiveand plays a well-understood role in the theory of computation.regular expressionmatchesdoes not match.*SPB.*(substring search)RASPBERRYCRISPBREADSUBSPACESUBSPECIES[0-9]{3}-[0-9]{2}-[0-9]{4}(Social Security numbers)166-11-4433166-45-111111-555555558675309[a-z]+@([a-z]+\.)+(edu|com)(email addresses)[email protected]@princeton.eduspam@nowhere[$_A-Za-z][$_A-Za-z0-9]*(Java identifiers)ident3PatternMatcher3aident#38Illegally screening a job candidate[First name of a candidate]! and pre/2 [last name of a candidate] w/7 bush or gore or republican! or democrat! or charg! or accus! or criticiz! or blam! or defend! or iran contra or clinton or spotted owl or florida recount or sex! or controvers! or racis! or fraud! or investigat! or bankrupt! or layoff! or downsiz! or PNTR or NAFTA or outsourc! or indict! or enron or kerry or iraq or wmd! or arrest! or intox! or fired or sex! or racis! or intox! or slur! or arrest! or fired or controvers! or abortion! or gay! or homosexual! or gun! or firearm!Nexis search string used by Monica Goodling in DOJ to vet potential Bush administration appointees“ [First name]! and pre/2 [last name] w/7 bush or gore or republican! or democrat! or charg! or accus! or criticiz! or blam! or defend! or iran contra or clinton or spotted owl or florida recount or sex! or controvers! or racis! or fraud! or investigat! or bankrupt! or layoff! or downsiz! or PNTR or NAFTA or outsourc! or indict! or enron or kerry or iraq or wmd! or arrest! or intox! or fired or sex! or racis! or intox! or slur! or arrest! or fired or controvers! or abortion! or gay! or homosexual! or gun! or firearm! ” — LexisNexis search string used by Monica Goodling to illegally screen candidates for DOJ positionshttp://www.justice.gov/oig/special/s0807/final.pdf9Can the average web surfer learn to use REs?Google. Supports * for full word wildcard and | for union.10Regular expressions to the rescuehttp://xkcd.com/20811Can the average programmer learn to use REs?Perl RE for valid RFC822 email addresses(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[
View Full Document