CSCI 5832 Natural Language Processing Jim Martin Lecture 21 01 14 19 1 Today 4 8 Finish WSD Start on IE Chapter 22 2 01 14 19 WSD and Selection Restrictions Ambiguous arguments Prepare a dish Wash a dish Ambiguous predicates Serve Denver Serve breakfast Both Serves vegetarian dishes 3 01 14 19 WSD and Selection Restrictions This approach is complementary to the compositional analysis approach You need a parse tree and some form of predicate argument analysis derived from The tree and its attachments All the word senses coming up from the lexemes at the leaves of the tree Ill formed analyses are eliminated by noting any selection restriction violations 4 01 14 19 Problems As we saw last time selection restrictions are violated all the time This doesn t mean that the sentences are ill formed or preferred less than others This approach needs some way of categorizing and dealing with the various ways that restrictions can be violated 5 01 14 19 Supervised ML Approaches That s too hard try something empirical In supervised machine learning approaches a training corpus of words tagged in context with their sense is used to train a classifier that can tag words in new text that reflects the training text 6 01 14 19 WSD Tags What s a tag A dictionary sense For example for WordNet an instance of bass in a text has 8 possible tags or labels bass1 through bass8 7 01 14 19 WordNet Bass The noun bass has 8 senses in WordNet 1 2 3 4 5 6 7 8 bass the lowest part of the musical range bass bass part the lowest part in polyphonic music bass basso an adult male singer with the lowest voice sea bass bass flesh of lean fleshed saltwater fish of the family Serranidae freshwater bass bass any of various North American lean fleshed freshwater fishes especially of the genus Micropterus bass bass voice basso the lowest adult male singing voice bass the member with the lowest range of a family of musical instruments bass nontechnical name for any of numerous edible marine and freshwater spiny finned fishes 8 01 14 19 Representations Most supervised ML approaches require a very simple representation for the input training data Vectors of sets of feature value pairs I e files of comma separated values So our first task is to extract training data from a corpus with respect to a particular instance of a target word This typically consists of a characterization of the window of text surrounding the target 9 01 14 19 Representations This is where ML and NLP intersect If you stick to trivial surface features that are easy to extract from a text then most of the work is in the ML system If you decide to use features that require more analysis say parse trees then the ML part may be doing less work relatively if these features are truly informative 10 01 14 19 Surface Representations Collocational and co occurrence information Collocational Encode features about the words that appear in specific positions to the right and left of the target word Often limited to the words themselves as well as they re part of speech Co occurrence Features characterizing the words that occur anywhere in the window regardless of position Typically limited to frequency counts 11 01 14 19 Examples Example text WSJ An electric guitar and bass player stand off to one side not really part of the scene just as a sort of nod to gringo expectations perhaps Assume a window of 2 from the target 12 01 14 19 Examples Example text An electric guitar and bass player stand off to one side not really part of the scene just as a sort of nod to gringo expectations perhaps Assume a window of 2 from the target 13 01 14 19 Collocational Position specific information about the words in the window guitar and bass player stand guitar NN and CJC player NN stand VVB In other words a vector consisting of position n word position n part of speech 14 01 14 19 Co occurrence Information about the words that occur within the window First derive a set of terms to place in the vector Then note how often each of those terms occurs in a given window 15 01 14 19 Co Occurrence Example Assume we ve settled on a possible vocabulary of 12 words that includes guitar and player but not and and stand guitar and bass player stand 0 0 0 1 0 0 0 0 0 1 0 0 16 01 14 19 Classifiers Once we cast the WSD problem as a classification problem then all sorts of techniques are possible Na ve Bayes the right thing to try first Decision lists Decision trees MaxEnt Support vector machines Nearest neighbor methods 17 01 14 19 Classifiers The choice of technique in part depends on the set of features that have been used Some techniques work better worse with features with numerical values Some techniques work better worse with features that have large numbers of possible values For example the feature the word to the left has a fairly large number of possible values 18 01 14 19 Na ve Bayes Argmax P sense feature vector Rewriting with Bayes and assuming independence of the features n argmax s S P s j 1 P vj s 19 01 14 19 Na ve Bayes P s just the prior of that sense Just as with part of speech tagging not all senses will occur with equal frequency P vj s conditional probability of some particular feature value combination given a particular sense You can get both of these from a tagged corpus with the features encoded 20 01 14 19 Na ve Bayes Test On a corpus of examples of uses of the word line na ve Bayes achieved about 73 correct Good 21 01 14 19 Problems Given these general ML approaches how many classifiers do I need to perform WSD robustly One for each ambiguous word in the language How do you decide what set of tags labels senses to use for a given word Depends on the application 22 01 14 19 WordNet Bass Tagging with this set of senses is an impossibly hard task that s probably overkill for any realistic application 1 2 3 4 5 bass the lowest part of the musical range bass bass part the lowest part in polyphonic music bass basso an adult male singer with the lowest voice sea bass bass flesh of lean fleshed saltwater fish of the family Serranidae freshwater bass bass any of various North American lean fleshed freshwater fishes especially of the genus Micropterus 6 bass bass voice basso the lowest adult male singing voice 7 bass the member with the lowest range of a family of musical instruments 8 bass nontechnical name for any of numerous edible marine and freshwater spiny finned fishes 23 01 14 19 Semantic Analysis When we covered semantic analysis in Chapter 18 we focused on The analysis
View Full Document
Unlocking...