6.863J Natural Language ProcessingLecture 2: Automata, Two-level phonology, & PC-Kimmo(the Hamlet lecture) Instructor: Robert C. Berwick6.863J/9.611J SP03 Lecture 2The Menu Bar• AdministriviaQuestionnaire posted (did you email it?)Lab1: split into Lab1a (this time) Lab1b (next time)• What and How: word processing, or computational morphology• What’s in a word: morphology• Modeling morpho-phonology by finite-state devices• Finite-state automata vs. finite state transducers• Some examples from English• PC-Kimmo & Laboratory 1:how-to6.863J/9.611J SP03 Lecture 2Levels of language• Phonetics/phonology/morphology: what words (or subwords) are we dealing with?• Syntax: What phrases are we dealing with? Which words modify one another?• Semantics: What’s the literal meaning?• Pragmatics: What should you conclude from the fact that I said something? How should you react?6.863J/9.611J SP03 Lecture 2The “spiral notebook” Modelthe dogs ate ice-creamθε dawgz…Sentence‘surface’formNoun phrase Verb phraseVerb Noun Phraseate ice-creamthe dogzλx, xε{dogs}, ate(x, i-c)‘sound’form‘phrase’form‘logical’form6.863J/9.611J SP03 Lecture 2Start with words: they illustrate all the problems (and solutions) in NLP• Parsing wordsCats → CAT + N(oun) + PL(ural)• Used in:• Traditional NLP applications• Finding word boundaries (e.g., Latin, Chinese)• Text to speech (boathouse)• Document retrieval (example next slide)• In particular, the problems of parsing, ambiguity,and computational efficiency (as well as the problems of how people do it)6.863J/9.611J SP03 Lecture 2Example from information retrieval• Keywork retrieval: marsupial or kangaroo or koala• Trying to form equivalence classes - ending not important• Can try to do this without extensive knowledge, but then:organization → organ European → Europegeneralization → generic noise → noisy6.863J/9.611J SP03 Lecture 2Morphology• Morphology is the study of how words are built up from smaller meaningful units called morphemes (morph= shape; logos=word)• Easy in English – what about other languages?6.863J/9.611J SP03 Lecture 2What about other languages?amarenamarainamaríanamenamaronamaránamambanamanamáisamareisamaraisamaríanosamemosamomosamremosamambaisamadamáisamamosamáremeamaraamaríaameamóamaráamambaamaamesamaresamarasamaríasamesamasteamarásamabasamaamasamareamaraamaríaameaméamaréamabaamoFutureSubj.Imp.Subj.CondPresentSubjunPreteriteFutureImperfIndic.ImperfPresent indicativeHow to love in Spanish…incomplete…you canfinish it after Valentine’s Day…6.863J/9.611J SP03 Lecture 2What about other languages?6.863J/9.611J SP03 Lecture 2What about other processes?• Stem: core meaning unit (morpheme) of a word• Affixes: bits and pieces that combine with the stem to modify its meaning and grammatical functionsPrefix: un- , anti-, etc.Suffix: -ity, -ation, etc.Infix:Tagalog: um+hinigi → humingi (borrow)Any infixes in ‘nonexotic’ language like English?Here’s one: un-f******-believable6.863J/9.611J SP03 Lecture 2OK, now how do we deal with this computationally?• What knowledge do we need?• How is that knowledge put to use?• What:duckling; beer (implies what K…?)chase + ed → chased (implies what K?)breakable + un →unbreakable (‘prefix’)• How: a bit trickier, but clearly we are at least doing this kind of mapping…6.863J/9.611J SP03 Lecture 2Our goal: PC-Kimmof lSurface formLexiconiseRulesF LY+ SLexical form6.863J/9.611J SP03 Lecture 2Two parts to the “what”1. Which units can glue to which others (roots and affixes) (or stems and affixes), eg, 2. What ‘spelling changes’ (orthographic changes) occur – like dropping the e in ‘chase + ed’ OK, let’s tackle these one at a time, but first consider a (losing) alternative…6.863J/9.611J SP03 Lecture 2KISS: A (very) large dictionary1. Impractical: some languages associate a single meaning w/ aSagan number of distinct surface forms (600 billion in Turkish)German: Leben+s+versichergun+gesellschaft+s+angestellter(life+CmpAug+insurance+CmpAug+company+CompAug+employee)Chinese compounding: about 3000 ‘words,’ combine to yield tens of thousands2. Speakers don’t represent words as a listWug test (Berko, 1958)Juvenate is rejected slower than pertoire (real prefix matters)6.863J/9.611J SP03 Lecture 2Representing possible roots + affixes as a finite-state automaton/usr/dict/wordsFSM17728 states, 37100 arcs2 sec25K words206K charsclearclevereareverfatfatherWordlistcompilerlc aeveethfaNetwork6.863J/9.611J SP03 Lecture 2Now add in states to get possible combos, as well as features+Adjr+Compb i geThis much is easy – a straightforward fsaStates = equivalence classeslfailaccept06.863J/9.611J SP03 Lecture 2English morphology: what states do we need for the fsa?• As an example, consider adjectivesBig, bigger, biggestCool, cooler, coolest, coollyRed, redder, reddestClear, clearer, clearest, clearly, unclear, unclearlyHappy, happier, happiest, happilyUnhappy, unhappier, unhappiest, unhappilyReal, unreal, silly6.863J/9.611J SP03 Lecture 2Will this fsa work?06.863J/9.611J SP03 Lecture 2Ans: no!• Accepts all adjectives above, but• Also accepts unbig, readly, realest• Common problem: overgeneration• Solution?6.863J/9.611J SP03 Lecture 2Revised picture6.863J/9.611J SP03 Lecture 2How does PC-Kimmo represent this?Here’s what the pc-kimmo fsalooks like – the fsa states are called ‘alternation classes’ or ‘lexicons’6.863J/9.611J SP03 Lecture 2PC-Kimmo states for affix combos (portion) = lexicon treeBegin (Initial)N_root Adj_prefix V_prefix(at start of file english.lex)N_root2N_root1N_suffix GenitiveNumberENDENDENDENDENDAdj_root6.863J/9.611J SP03 Lecture 2Next: what about the spelling changes? That’s harder!ü Which units can glue to which others (roots and affixes) (or stems and affixes) 2. What ‘spelling changes’ (orthographic changes) occur – like dropping the e in ‘chase + ed’6.863J/9.611J SP03 Lecture 2Mapping between surface form & underlying formc h a s e dc h a s e + e dSurface:Underlying:But clearly this can go either way – given the underlying form, we can generate the surface form – so we reallyhave a relation betw. surface & underlying form, viz.:6.863J/9.611J SP03 Lecture 2Conventional notationLexical (underlying) form: c h a s e + e dSurface form: c h a s 0 0 e dThe 0’s “line up” the lexical & surface
View Full Document