CMSC 723 Computational Linguistics I Session 3 Finite State Morphology Jimmy Lin The iSchool University of Maryland Wednesday September 16 2009 Today s Agenda Computational tools z z z Regular expressions Finite state automata deterministic vs non deterministic Finite state transducers Overview of morphological processes Computational morphology with finite state methods Regular Expressions A metalanguage for specifying simple classes of strings z Very useful in searching and matching text strings Everyone does it z Implementations in the shell Perl Java Python Regular Expressions Basic regular expressions happy happy abcd a b c d a d a b c d a d a d e f g z Tt he the The dog cat dog cat Special metacharacters colou r color colour oo h oo h oh oh ooh ooh oooh oooh oo h ooh oooh ooooh beg n began begin begun begbn NLP with Regular Expressions Transcript with Eliza simulation of a Rogerian psychotherapist Weizenbaum 1966 User Men are all alike ELIZA IN WHAT WAY User They re always bugging us about something or other ELIZA CAN YOU THINK OF A SPECIFIC EXAMPLE User Well my boyfriend made me come here ELIZA YOUR BOYFRIEND MADE YOU COME HERE User He says I m I m depressed much of the time ELIZA I AM SORRY TO HEAR THAT YOU ARE DEPRESSED How did it work all IN WHAT WAY always CAN YOU THINK OF A SPECIFIC EXAMPLE I m depressed sad I AM SORRY TO HEAR YOU ARE 1 I m depressed sad WHY DO YOU THINK YOU ARE 1 Aside What is intelligence What at does Eliza a tell te us about intelligence te ge ce Equivalence Relations We can say the following z z z Regular expressions describe a regular language Regular expressions can be implemented by finite state automata Regular languages can be generated by regular grammars So what Finite State Automata Regular Expressions Regular Languages Regular Grammars Sheeptalk Language baa b baaa baaaa baaaaa R Regular l Expression E i baa Finite State Automaton Finite State b q0 a q1 a q2 q3 a q4 Finite State Automata What are they What at do tthey ey do do How do they work FSA What are they Q a finite set of N states z z z a finite input alphabet of symbols z Q q0 q1 q2 q3 q4 The start state q0 The set of final states F q4 a b q i transition function q i z z Given state q and input symbol i return new state q q3 q4 b q0 a q1 a q2 q3 a q4 FSA State Transition Table State Input a b 0 1 1 2 3 2 3 3 4 b q0 a q1 4 a q2 q3 a q4 FSA What do they do Given a string a FSA either rejects or accepts it z z z z z z z ba reject baa accept baaaz reject baaaa accept baaaaaa accept baa reject moooo reject What does this have to do with NLP z Think grammaticality FSA How do they work q0 q1 q2 b a q3 a b q0 q3 a a q1 q4 ACCEPT a q2 q3 a q4 FSA How do they work q0 q1 q2 b a b q0 a q1 REJECT a q2 q3 a q4 D RECOGNIZE Accept or Generate Formal languages are sets of strings z Finite state automata define formal languages z Strings composed of symbols drawn from a finite alphabet Without having to enumerate all the strings in the language Two views of FSAs z z Acceptors that can tell you if a string is in the language Generators to produce all and only the strings in the language Simple NLP with FSAs Introducing Non Determinism Deterministic vs Non deterministic FSAs Epsilon transitions Using NFSAs to Accept Strings What does it mean z z General approaches z z z Accept there exist at least one path need not be all paths Reject no paths exist Backup add markers at choice points then possibly revisit unexplored arcs at marked choice point Look ahead look ahead in input to provide clues Parallelism look at alternatives in parallel Recognition with NFSAs as search through state space z Agenda holds state tape position pairs ND RECOGNIZE ND RECOGNIZE State Orderings Stack LIFO depth first Queue FIFO O b breadth first eadt st ND RECOGNIZE Example ACCEPT What s the point NFSAs and DFSAs are equivalent z Equivalence between regular expressions and FSA z For every NFSA there is a equivalent DFSA and vice versa Easy to show with NFSAs Why use NFSAs Regular Language Definition is a regular language a a a iss a regular egu a language a guage If L1 and L2 are regular languages then so are z z z L1 L2 x y x L1 y L2 the concatenation of L1 and L2 L1 L2 the union or disjunction of L1 and L2 L1 the Kleene closure of L1 Regular Languages Starting Points Regular Languages Concatenation Regular Languages Disjunction Regular Languages Kleene Closure Finite State Transducers FSTs A two tape automaton that recognizes or generates pairs of strings Think of an FST as an FSA with two symbol strings on each arc z One symbol string from each tape Four fold view of FSTs As a recognizer Ass a ge generator e ato As a translator As a set relater Summary Computational Tools Regular expressions Finite state te state auto automata ata dete deterministic st c vs s non deterministic o dete st c Finite state transducers Computational Morphology Definitions and problems z z What is morphology Topology of morphologies Computational morphology z Finite state methods Morphology Study of how words are constructed from smaller units of meaning Smallest unit of meaning morpheme z z z fox has morpheme fox cats has two morphemes cat and s Note it is useful to distinguish morphemes from orthographic rules Two classes of morphemes z z Stems supply the main meaning Affixes add additional meaning Topology of Morphologies Concatenative vs non concatenative Derivational e at o a vs s inflectional ect o a Regular vs irregular Concatenative Morphology Morpheme Morpheme Morpheme Stems Ste s a also so ca called ed lemma e a base form o root oot lexeme e e e z z Affixes z z hope ing hoping hop ing hopping Prefixes Antidisestablishmentarianism Suffixes Antidisestablishmentarianism Agglutinative languages e g Turkish z z uygarla t ramad klar m zdanm s n zcas na uygar la t r ama d k lar m z dan m s n z cas na Meaning behaving as if you are among those whom we could not cause to become civilized Non Concatenative Morphology Infixes e g Tagalog z z Circumfixes e g German z z hingi borrow humingi borrower sagen say gesagt said Reduplication e e g g Motu Motu spoken in Papua New Guinea z z z mahuta to sleep mahutamahuta to sleep constantly mamahuta to sleep plural Templatic Morphologies Common in Semitic languages Roots oots and a d patte patterns s Arabic Hebrew maktuub written ktuuv written Derivational Morphology Stem morpheme z z Nominalization in English z z z Word with different …
View Full Document
Unlocking...