Unformatted text preview:

CMSC 723 LING 645 Intro to Computational Linguistics September 8 2004 Monz Regular Expressions and Finite State Automata J M 2 Prof Bonnie J Dorr Dr Christof Monz TA Adam Lee Regular Expressions and Finite State Automata REs Language for specifying text strings Search for document containing a string Searching for woodchuck How much wood would a woodchuck chuck if a woodchuck would chuck wood Searching for woodchucks with an optional final s Finite state automata FSA singular automaton Regular Expressions Basic regular expression patterns Perl based syntax slightly different from other notations for regular expressions Disjunctions wW oodchuck Regular Expressions Ranges A Z Negations Ss Regular Expressions Optional characters and 0 or 1 colou r color or colour 0 or more oo h oh or Ooh or Ooooh 1 or more o h oh or Ooh or Ooooh Wild cards Stephen Cole Kleene beg n begin or began or begun Regular Expressions Anchors and A Z Ramallah Palestine A Z verdad really It is over Boundaries b and B bon b on my way Monday Bon b automaton Disjunction yours mine it is either yours or mine Disjunction Grouping Precedence Column 1 Column 2 Column 3 How do we express this Column 0 9 Column 0 9 Precedence Parenthesis Counters Sequences and anchors Disjunction REs are greedy the my end Perl Commands While line STDIN if line the print MATCH line Writing correct expressions Exercise Write a Perl regular expression to match the English article the the tT he b tT he b a zA Z tT he a zA Z a zA Z tT he a zA Z A more complex example Exercise Write a regular expression that will match any PC with more than 500MHz and 32 Gb of disk space for less than 1000 0 9 0 9 0 9 0 9 b 0 9 0 9 0 9 b b 0 9 0 9 0 9 0 9 0 9 b b 0 9 MG Hz Mm egahertz Gg igahertz b b 0 9 Mb Mm egabytes b b 0 9 0 9 Gb Gg igabytes b Advanced operators should be Substitutions and Memory Substitutions s colour color s colour color g s colour color i Substitute as many times as possible Case insensitive matching Memory 1 2 etc refer back to matches s Cc olour 1olor the er they were the 1er they will be the er they the 1er they 2 Eliza Weizenbaum 1966 User Men are all alike ELIZA IN WHAT WAY User They re always bugging us about something or other ELIZA CAN YOU THINK OF A SPECIFIC EXAMPLE User Well my boyfriend made me come here ELIZA YOUR BOYFRIEND MADE YOU COME HERE User He says I m depressed much of the time ELIZA I AM SORRY TO HEAR THAT YOU ARE DEPRESSED Eliza style regular expressions Step 1 replace first person with second person references s bI m am b YOU ARE g s bmy b YOUR g S bmine b YOURS g Step 2 use additional regular expressions to generate replies s YOU ARE depressed sad I AM SORRY TO HEAR YOU ARE 1 s YOU ARE depressed sad WHY DO YOU THINK YOU ARE 1 s all IN WHAT WAY s always CAN YOU THINK OF A SPECIFIC EXAMPLE Step 3 use scores to rank possible transformations Finite state Automata Finite state automata FSA Regular languages Regular expressions Finite state Automata Machines baa baaa baaaa baaaaa baa a b q0 a q1 a q2 state q3 transition q4 final state Input Tape q0 a b b 0 a a 1 a a 2 REJECT b 3 4 Input Tape q0 q1 q2 q3 b a a b 0 q3 a a 1 q4 a a 2 ACCEPT 3 4 Finite state Automata Q a finite set of N states Q q0 q1 q2 q3 q4 a finite input alphabet of symbols a b q0 the start state F the set of final states F q4 q i transition function Given state q and input symbol i return new state q q3 q4 State transition Tables State 0 1 2 3 4 b 1 Input a 2 3 3 4 D RECOGNIZE function D RECOGNIZE tape machine returns accept or reject index Beginning of tape current state Initial state of machine loop if End of input has been reached then if current state is an accept state then return accept else return reject elsif transition table current state tape index is empty then return reject else current state transition table current state tape index index index 1 end Adding a failing state a b q0 a a q1 q2 b q3 q4 b b b a qF a Adding an all else arc a b q0 a q1 a q2 q3 qF q4 Languages and Automata Can use FSA as a generator as well as a recognizer Formal language L defined by machine M that both generates and recognizes all and only the strings of that language L M baa baaa baaaa Regular languages vs non regular languages Languages and Automata Deterministic vs Non deterministic FSAs Epsilon transitions Using NFSAs to accept strings Backup add markers at choice points then possibly revisit unexplored arcs at marked choice point Look ahead look ahead in input Parallelism look at alternatives in parallel Using NFSAs Input State 0 1 2 3 4 b 1 a 2 2 3 4 Readings for next time J M Chapter 3


View Full Document

UMD CMSC 723 - Intro to Computational Linguistics

Documents in this Course
Lecture 9

Lecture 9

12 pages

Smoothing

Smoothing

15 pages

Load more
Loading Unlocking...
Login

Join to view Intro to Computational Linguistics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Intro to Computational Linguistics and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?