Natural Language Processing Lecture 4 9 5 2013 Jim Martin Today Finish up FSAs FSTs for lexicons and morphology Brief bit on OpenFST Tokenization Segmentation Spelling Minimum edit distance and dynamic programming 01 14 19 Speech and Language Processing Jurafsky and Martin 2 Morphology and FSAs We would like to use the machinery provided by FSAs to capture facts about morphology Accept strings words that are legitimate words in the language Reject those that are not And do so in a way that doesn t require us to list all the forms of all the words in the language Even in English this is inefficient And in other languages it is impossible 01 14 19 Speech and Language Processing Jurafsky and Martin 3 Start Simple Regular singular nouns are ok as is They are in the language Regular plural nouns have an s on the end So they re also in the language Irregulars are ok as is Irregulars with regular endings s need to be blocks 01 14 19 Speech and Language Processing Jurafsky and Martin 4 Simple Rules 01 14 19 Speech and Language Processing Jurafsky and Martin 5 Now Plug in the Words Spelled Out Replace the class names like reg noun with FSAs that recognize all the words in that class 01 14 19 Speech and Language Processing Jurafsky and Martin 6 Derivational Rules If everything is an accept state how do things ever get rejected 01 14 19 Speech and Language Processing Jurafsky and Martin 7 Lexicons So the big picture is to store a lexicon list of words you care about as an FSA The base lexicon is embedded in larger automata that captures the inflectional and derivational morphology of the language So what Well the simplest thing you can do with such an FSA is spell checking If the machine rejects the word isn t in the language Without listing every form of every word 01 14 19 Speech and Language Processing Jurafsky and Martin 8 Parsing Generation vs Recognition We can now run strings through these machines to recognize strings in the language But recognition is usually not quite what we need Often if we find some string in the language we might like to assign a structure to it parsing Or we might start with some structure and want to produce a surface form for it production generation For that we ll move to finite state transducers Add a second tape that can be written to 01 14 19 Speech and Language Processing Jurafsky and Martin 9 Finite State Transducers The simple story Add another tape Add extra symbols to the transitions On one tape we read cats on the other we write cat N PL N and PL are elements in the alphabet for one tape that represent underlying linguistic features 01 14 19 Speech and Language Processing Jurafsky and Martin 10 FSTs 01 14 19 Speech and Language Processing Jurafsky and Martin 11 The Gory Details Of course its not as easy as cat N PL cats As we saw earlier there are geese mice and oxen But there are also a whole host of spelling pronunciation changes that go along with inflectional changes Cats vs Dogs s sound vs z sound Fox and Foxes that e got inserted And doubling consonants swim swimming adding k s picnic picnicked deleting e s 01 14 19 Speech and Language Processing Jurafsky and Martin 12 Multi Tape Machines To deal with these complications we will add even more tapes and use the output of one tape machine as the input to the next So to handle irregular spelling changes we will add intermediate tapes with intermediate symbols 01 14 19 Speech and Language Processing Jurafsky and Martin 13 Multi Level Tape Machines M 1 M 2 We use one machine to transduce between the lexical and the intermediate level M1 and another M2 to handle the spelling changes to the surface tape M1 knows about the particulars of the lexicon M2 knows about weird English spelling rules 01 14 19 Speech and Language Processing Jurafsky and Martin 14 Lexical to Intermediate Level 01 14 19 Speech and Language Processing Jurafsky and Martin 15 Intermediate to Surface The add an e English spelling rule as in fox s foxes 01 14 19 Speech and Language Processing Jurafsky and Martin 16 Foxes 01 14 19 Speech and Language Processing Jurafsky and Martin 17 Foxes 01 14 19 Speech and Language Processing Jurafsky and Martin 18 Foxes 01 14 19 Speech and Language Processing Jurafsky and Martin 19 Note A key feature of this lower machine is that it has to do the right thing for inputs to which it doesn t apply So fox s foxes but bird s birds and cat cat 01 14 19 Speech and Language Processing Jurafsky and Martin 20 Overall Scheme We now have one FST that has explicit information about the lexicon actual words their spelling facts about word classes and regularity Lexical level to intermediate forms We have a larger set of machines that capture orthographic spelling rules Intermediate forms to surface forms One machine for each spelling rule 01 14 19 Speech and Language Processing Jurafsky and Martin 21 Overall Scheme Separate FSTs for each of the English spelling rules we want to capture 01 14 19 Speech and Language Processing Jurafsky and Martin 22 Cascades This is an architecture that we ll see again and again Overall processing is divided up into distinct rewrite steps The output of one layer serves as the input to the next The intermediate tapes may or may not end up being useful in their own right 01 14 19 Speech and Language Processing Jurafsky and Martin 23 Overall Plan Unfortunately as an architecture this is a little unwieldy We don t really want to mess with multiple tapes And we really don t want to mess with multiple machines reading and writing the same tapes in parallel 01 14 19 Speech and Language Processing Jurafsky and Martin 24 Final Scheme 01 14 19 Speech and Language Processing Jurafsky and Martin 25 Intersecting FSTs Recall that for FSAs it was ok to take the intersection of two regular languages and have the result still be regular Regular languages are closed under intersection There s no such guarantee for FSTs Regular relations are not closed under intersection in general But interesting subsets are 01 14 19 Speech and Language Processing Jurafsky and Martin 26 Composing FSTs 1 Create a set of new states that correspond to each pair of states from the original machines New states are called x y where x is a state from M1 and y is a state from M2 2 Create a new FST transition table for the new machine according to the following intuition 01 14 19 Speech and Language Processing Jurafsky and Martin 27 Composition There should be a transition between two states in the new machine if it is
View Full Document
Unlocking...