CSC 9010 Natural Language Processing Lecture 2: Regular Expressions, Finite State Automata Paula Matuszek Mary-Angela PapalaskariRegular Expressions and Text SearchingExampleTwo kinds of ErrorsTwo Antagonistic GoalsFinite State AutomataSlide 7More examples:Another FSA for the same language:Formally Specifying a FSADollars and CentsRecognitionTuring’s way of Visualizing RecognitionSlide 14D-RecognizeKey PointsSlide 17Recognition as SearchGenerative FormalismsSlide 20ReviewThree ViewsDefining Languages with ProductionsNon-DeterminismNon-Determinism cont.Are Non-deterministic FSA more powerful?Non-Deterministic RecognitionSlide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38ND-Recognize CodeInfinite SearchWhy Bother?Compositional MachinesUnionConcatenationNegationIntersection01/14/19CSC 9010- NLP - Regex, Finite State Automata 1CSC 9010Natural Language ProcessingLecture 2: Regular Expressions, Finite State AutomataPaula MatuszekMary-Angela PapalaskariPresentation slides adapted from Jim Martin’s course: http://www.cs.colorado.edu/~martin/csci5832.html01/14/19CSC 9010- NLP - Regex, Finite State Automata 2Regular Expressions and Text Searching•Everybody does it–Emacs, vi, perl, grep, etc..01/14/19CSC 9010- NLP - Regex, Finite State Automata 3Example•Find me all instances of the word “the” in a text.–/the/–/[tT]he/–/\b[tT]he\b/01/14/19CSC 9010- NLP - Regex, Finite State Automata 4Two kinds of Errors•Matching strings that we should not have matched (there, then, other)–False positives•Not matching things that we should have matched (The)–False negatives01/14/19CSC 9010- NLP - Regex, Finite State Automata 5Two Antagonistic Goals•Accuracy –(minimize false positives)•Coverage –(minimize false negatives).01/14/19CSC 9010- NLP - Regex, Finite State Automata 6Finite State Automata•Idealized machines for processing regular expressions•Example: /baa+!/01/14/19CSC 9010- NLP - Regex, Finite State Automata 7Finite State Automata•Idealized machines for processing regular expressions•Example: /baa+!/initial state accept state• 5 states• 5 transitions• alphabet?01/14/19CSC 9010- NLP - Regex, Finite State Automata 8More examples:01/14/19CSC 9010- NLP - Regex, Finite State Automata 9Another FSA for the same language:01/14/19CSC 9010- NLP - Regex, Finite State Automata 10Formally Specifying a FSA–The set of states: Q–A finite alphabet: Σ–A start state–A set of accept/final states–A transition function that maps QxΣ to Q01/14/19CSC 9010- NLP - Regex, Finite State Automata 11Dollars and Cents01/14/19CSC 9010- NLP - Regex, Finite State Automata 12Recognition•Recognition is the process of determining if a string should be accepted by a machine•Or… it’s the process of determining if as string is in the language we’re defining with the machine•Or… it’s the process of determining if a regular expression matches a string01/14/19CSC 9010- NLP - Regex, Finite State Automata 13Turing’s way of Visualizing Recognition01/14/19CSC 9010- NLP - Regex, Finite State Automata 14Recognition•Begin in the start state•Examine current input•Consult the table•Go to a new state and update the tape pointer.•When you run out of tape:•if in accepting state, accept input•else reject input01/14/19CSC 9010- NLP - Regex, Finite State Automata 15D-Recognize01/14/19CSC 9010- NLP - Regex, Finite State Automata 16Key Points•Deterministic means that at each point in processing there is always one unique thing to do (no choices).•D-recognize is a simple table-driven interpreter•The algorithm is universal for all unambiguous languages.–To change the machine, you change the table.01/14/19CSC 9010- NLP - Regex, Finite State Automata 17Key Points•Crudely therefore… matching strings with regular expressions is a matter of –translating the expression into a machine (table) and –passing the table to an interpreter01/14/19CSC 9010- NLP - Regex, Finite State Automata 18Recognition as Search•You can view this algorithm as a degenerate kind of state-space search.•States are pairings of tape positions and state numbers.•Operators are compiled into the table•Goal state is a pairing with the end of tape position and a final accept state•Its degenerate because?01/14/19CSC 9010- NLP - Regex, Finite State Automata 19Generative Formalisms•Formal Languages are sets of strings composed of symbols from a finite set of symbols.•Finite-state automata define formal languages (without having to enumerate all the strings in the language)•The term Generative is based on the view that you can run the machine as a generator to get strings from the language.01/14/19CSC 9010- NLP - Regex, Finite State Automata 20Generative Formalisms•FSAs can be viewed from two perspectives:–Acceptors that can tell you if a string is in the language–Generators to produce all and only the strings in the language01/14/19CSC 9010- NLP - Regex, Finite State Automata 21Review•Regular expressions are just a compact textual representation of FSAs•Recognition is the process of determining if a string/input is in the language defined by some machine.–Recognition is straightforward with deterministic machines.01/14/19CSC 9010- NLP - Regex, Finite State Automata 22Three Views•Three equivalent formal ways to look at what we’re up to (not including tables)Regular ExpressionsRegular LanguagesFinite State Automata01/14/19CSC 9010- NLP - Regex, Finite State Automata 23Defining Languages with ProductionsS → b a a AA → a AA → !S → NP VPNP → PrNounNP → Det NounDet → a | theNoun → cat | dog| bookPrNoun → samantha |elmer | fidoVP → IVerb | TVerb NPIVerb → ran |slept | ateTVerb → hit | kissed | ateRegular?Regular language01/14/19CSC 9010- NLP - Regex, Finite State Automata 24Non-DeterminismCompare:01/14/19CSC 9010- NLP - Regex, Finite State Automata 25Non-Determinism cont.•Epsilon transitions:–Note: these transitions do not examine or advance the tape during recognitionε01/14/19CSC 9010- NLP - Regex, Finite State Automata 26Are Non-deterministic FSA more powerful?NO:•Non-deterministic machines can be converted to deterministic ones with a fairly simple construction•One way to do recognition with a non-deterministic machine is to turn it into a deterministic one.01/14/19CSC 9010- NLP - Regex, Finite State Automata 27Non-Deterministic Recognition•In a ND FSA there exists at least one path through the
View Full Document