Unformatted text preview:

CS 224n Final ProjectUnsupervised Methods for Word Sense DisambiguationKlaus Völker and Simone Wu1. Introduction1.1 Data Processing2. Clustering Using Seed Collocations4. Bayesian Clustering Methods4.1 Expectation-Maximization4.2 Multivariate Pólya distributionMonte Carlo SamplingMultiple word senses4.3 Dirichlet Process Model5. Conclusion and Further DirectionsStatement of CollaborationCS 224n Final ProjectUnsupervised Methods for Word SenseDisambiguationKlaus Völker and Simone Wu1. IntroductionIn this project we investigate a variety of unsupervised methods for word sensedisambiguation. The methods range from a "nearly unsupervised" approach that usesseed collocations for each word sense to nonparametric Bayesian clustering methodsthat can (we hope) automatically infer the proper number of senses for each word.Word sense disambiguation is essentially a clustering problem. The general approach isas follows: For each appearance of a given word we extract a context consisting of, say,the m surrounding words. We then group these context snippets into clusters accordingto some similarity measure. Finally we interpret each of the resulting clusters as adifferent sense of the word in question. With conventional clustering algorithms, we needto specify the desired number of clusters as an input parameter or determine it usingsome ad-hoc heuristic, e.g. based on the quality of the fit. A better way is to applynonparametric clustering, e.g. via a Dirichlet process model, that determines the idealnumber of clusters, and hence the number of senses, from the training data. One of ourgoals is to determine whether this approach is practical.We deliberately avoid using any tagged corpora or dictionaries/thesauri for training ourmodel, since we want our method to be applicable even if such resources are notavailable. This means our method could also be used to detect new meanings of wordsthat are not yet reflected in any dictionary, or to disambiguate proper names.Once we have clustered our training examples according to word sense, we can build aclassifier that tags the words in an unseen context. In general, such a classifier isimplicitly provided by the clustering method we use. For example, in Bayesian clusteringwe can compute the probability that the new text belongs to a given cluster, and thenassign the cluster with the highest probability. The potential benefits of such a tagger fore.g. information retrieval tasks should be obvious.1.1 Data ProcessingFor all trials, unless we indicate otherwise, we used the 1987 New York Times AnnotatedCorpus. We used a script to strip the XML formatting and metadata, and performedfurther processing such as tokenization, removal of punctuation, lowercasing, and insome cases filtering for articles that did not have enough content to be of use to us.We created a test suite of unigram and bigram pseudowords as the primary method ofdetermining the effectiveness of the various clustering algorithms in distinguishingbetween word senses. We also did some tests on actual ambiguous words such as"space", "capital", and "plant", but since we were required to hand-label the data in thatcase, it was not practical to test too extensively. We will report some of our results onthe first 200 incidences of ambiguous words for the sake of completeness, but ourpseudoword results will be the focus of our performance analysis since they are so muchless time-consuming to produce and not subject to the same level of sampling bias asthe necessarily small samples we hand-labeled.We chose the bigrams for the pseudoword tests from a set of bigrams that appeared inthe 1987 NYT data between 500 and 1000 times. Their frequencies are thus typicallyquite close. Bigrams are a useful test subject for a disambiguation system becausebigrams have just one sense more commonly than single words, so a pseudowordcomposed of two bigrams is quite likely to have two distinguishable senses. We tried tochoose a range of bigrams, from possibly confusing ("special prosecutor/admiralpoindexter") to very clear-cut ("olive oil/armed forces"). By contrast, our unigrampseudowords in some cases appear with very different frequencies. We use the mostfrequent sense as our reported baseline. We also chose some single words to combine aspseudowords, which we thought would be a more interesting challenge for our classifieras most single words tend to have multiple senses.2. Clustering Using Seed CollocationsHere we build a semi-supervised disambiguation model that is a variant of Yarowsky'salgorithm.1This model trains a classifier for each sense on a small number of labeledexamples, and then iteratively improves the classifier by adding the unlabeled examplesthat the classifiers are most confident about to the training set. The initial examples arederived by identifying seed collocations, that is, by looking for examples in which theword of interest co-occurs with words that are very specific to one of the senses. Themain difference from Yarowsky's approach is that we use a maximum entropy classifierinstead of a decision list.Here is the algorithm in pseudocode:1. identify a small number of seed collocations for each word sense2. for each word sense:1. find all examples containing the word of interest and at least one seedcollocation for this sense2. tag these examples as initial training examples3. in each iteration:1. train a maximum-entropy classifier based on the current set of trainingexamples2. classify all examples containing the word of interest3. extract the examples for which Pr(class) exceeds a certain threshold, and usethem as training examples for the next iterationWe use a probability threshold Pr(class) >= 80% and keep iterating until the number oftraining examples selected does not increase anymore from iteration to iteration. Themaximum entropy classifier uses the following features:• the word immediately preceding and following the word of interest, and• all words in a 50-word window surrounding the word of interest.However, we noticed that the classifier can sometimes be thrown off by stopwords, sowe removed any stopwords from the set of features.2Visual inspection shows that thisalgorithm does quite well disambiguating the two most common senses of the words"plant" (living vs. industrial), "power" (electric vs. political), and "space" (outer space vs.volume). In the absence of hand-labeled examples,3we use the pseudoword techniquefor a quantitative evaluation of the algorithm's


View Full Document
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?