CMSC 723 / LING 645: Intro to Computational LinguisticsRegular Expressions and Finite State AutomataRegular ExpressionsSlide 4Slide 5Slide 6Disjunction, Grouping, PrecedencePerl CommandsWriting correct expressionsA more complex exampleAdvanced operatorsSubstitutions and MemoryEliza [Weizenbaum, 1966]Eliza-style regular expressionsFinite-state AutomataFinite-state Automata (Machines)Input TapeSlide 18Slide 19State-transition TablesD-RECOGNIZEAdding a failing stateAdding an “all else” arcLanguages and AutomataSlide 25Using NFSAs to accept stringsUsing NFSAsReadings for next timeCMSC 723 / LING 645: Intro to Computational LinguisticsSeptember 8, 2004: MonzRegular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. DorrDr. Christof MonzTA: Adam LeeRegular Expressions and Finite State AutomataREs: Language for specifying text stringsSearch for document containing a string–Searching for “woodchuck” •Finite-state automata (FSA)(singular: automaton)•How much wood would a woodchuck chuck if a woodchuck would chuck wood?–Searching for “woodchucks” with an optional final “s”Regular ExpressionsBasic regular expression patternsPerl-based syntax (slightly different from other notations for regular expressions)Disjunctions /[wW]oodchuck/Regular ExpressionsRanges [A-Z]Negations [^Ss]Regular ExpressionsOptional characters ? ,* and +–? (0 or 1) •/colou?r/ color or colour–* (0 or more)•/oo*h!/ oh! or Ooh! or Ooooh!*+Stephen Cole Kleene –+ (1 or more) •/o+h!/ oh! or Ooh! or Ooooh!Wild cards .- /beg.n/ begin or began or begunRegular ExpressionsAnchors ^ and $–/^[A-Z]/ “Ramallah, Palestine”–/^[^A-Z]/ “¿verdad?” “really?”–/\.$/ “It is over.”–/.$/ ?Boundaries \b and \B–/\bon\b/ “on my way” “Monday”–/\Bon\b/ “automaton” Disjunction |–/yours|mine/ “it is either yours or mine”Disjunction, Grouping, PrecedenceColumn 1 Column 2 Column 3 …How do we express this?/Column [0-9]+ *//(Column [0-9]+ +)*/Precedence–Parenthesis ()–Counters * + ? {}–Sequences and anchors the ^my end$–Disjunction |REs are greedy!Perl CommandsWhile ($line=<STDIN>){if ($line =~ /the/){print “MATCH: $line”;}}Writing correct expressionsExercise: Write a Perl regular expression to match the English article “the”:/the//[tT]he//\b[tT]he\b//[^a-zA-Z][tT]he[^a-zA-Z]//(^|[^a-zA-Z])[tT]he[^a-zA-Z]/A more complex exampleExercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:/$[0-9]+//$[0-9]+\.[0-9][0-9]//\b$[0-9]+(\.[0-9][0-9])?\b//\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b//\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b//\b[0-9]+ *(Mb|[Mm]egabytes?)\b//\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/Advanced operatorsshould be _Substitutions and MemorySubstitutionss/colour/color/s/colour/color/gs/([Cc]olour)/$1olor//the (.*)er they were, the $1er they will be//the (.*)er they (.*), the $1er they $2/Substitute as many times as possible!Case insensitive matchings/colour/color/iMemory ($1, $2, etc. refer back to matches)Eliza [Weizenbaum, 1966]User: Men are all alikeELIZA: IN WHAT WAYUser: They’re always bugging us about something or otherELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE?User: Well, my boyfriend made me come hereELIZA: YOUR BOYFRIEND MADE YOU COME HEREUser: He says I’m depressed much of the timeELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSEDEliza-style regular expressionss/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/s/.* all .*/IN WHAT WAY/s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/Step 1: replace first person with second person referencess/\bI(’m| am)\b /YOU ARE/gs/\bmy\b /YOUR/gS/\bmine\b /YOURS/gStep 2: use additional regular expressions to generate repliesStep 3: use scores to rank possible transformationsFinite-state AutomataFinite-state automata (FSA)Regular languagesRegular expressionsFinite-state Automata (Machines)/^baa+!$/q0q1q2q3q4b a a !astatetransitionfinalstate baa! baaa! baaaa! baaaaa! ...Input Tapea b a ! bq001 2 3 4b a a!aREJECTInput Tapeb a a aq0q1q2q3q3q4!01 2 3 4b a a!aACCEPTFinite-state AutomataQ: a finite set of N states –Q = {q0, q1, q2, q3, q4}: a finite input alphabet of symbols = {a, b, !}q0: the start stateF: the set of final states–F = {q4}(q,i): transition function –Given state q and input symbol i, return new state q'(q3,!) q4State-transition TablesInputState b a !0 1 Ø Ø1 Ø 2 Ø2 Ø 3 Ø3 Ø 3 44: Ø Ø ØD-RECOGNIZEfunction D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1endAdding a failing stateq0q1q2q3q4b a a !aqFa!b!b !bba!Adding an “all else” arcq0q1q2q3q4b a a !aqF====Languages and AutomataCan use FSA as a generator as well as a recognizerFormal language L: defined by machine M that both generates and recognizes all and only the strings of that language. –L(M) = {baa!, baaa!, baaaa!, …}Regular languages vs. non-regular languagesLanguages and AutomataDeterministic vs. Non-deterministic FSAsEpsilon () transitionsUsing NFSAs to accept stringsBackup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point.Look-ahead: look ahead in inputParallelism: look at alternatives in parallelUsing NFSAsInputState b a !0 1 Ø Ø Ø1 Ø 2 Ø Ø2 Ø 2,3 Ø Ø3 Ø Ø 4 Ø4: Ø Ø Ø ØReadings for next timeJ&M Chapter
View Full Document