1 Depending on use of high-quality, manually created, knowledge sources ◦ Knowledge-lean ◦ Knowledge-rich Depending on use of labeled data ◦ Supervised ◦ Semi- or minimally supervised ◦ Unsupervised 2 Lesk’s algorithm: Sense si of ambiguous word w is likely to be the intended sense if many of the words used in the dictionary definition of si are also used in the definitions of words in the context window. Only consider content words 3… the keyboard of the terminal was … terminal ◦ 1. a point on an electrical device at which electric current enters or leaves. ◦ 2. where transport vehicles load or unload passengers or goods. ◦ 3. an input-output device providing access to a computer. keyboard ◦ 1. set of keys on a piano or organ or typewriter or typesetting machine or computer or the like. ◦ 2. an arrangement of hooks on which keys or locks are hung. 4 Many variants possible ◦ include the examples in dictionary definitions. ◦ include other manually tagged example texts. ◦ Give more weight to larger overlaps ◦ Give extra weight to infrequent words occurring in the bags. Results: Simple versions of Lesk achieve accuracy around 50–60%; Lesk plus simple smarts gets to nearly 70%. 5 Manually labeled training data is fed to a machine learning algorithm Training and test sets must be non-overlapping. Why? More annotated data is again expensive to create ◦ Tractable only for small lexical sample tasks Create separate classifier for each word 6 Each training instance is converted to a feature vector Commonly used features ◦ Surface form of the target ◦ Part of speech of the target ◦ Unigrams and bigrams in the context of the target word and their part of speech ◦ Syntactic dependencies verb—object, subject—object,… 7 Instance: I opened an account at the bank <tag = “financial institution”> Bag of words feature vector: [I, opened, an, account, at, the, bank] Could take position into account Exploit collocations such as fine wine, blood bank All the training instance feature vectors are fed to a machine learning algorithm Decision trees, decision lists, naïve Bayes 8 A classifier is learnt for each word ◦ Number of possible classes equals number of senses seen in training data Convert unseen test instance into feature vector Feed feature vector to classifier which assigns a suitable sense/class to it. 9 Pick the sense that is most probably given the context Represent context by a bag of words Let f be the test instance feature vector Let S be the set of all senses of a target word Intended sense of wt = argmax P(s|f) Data sparseness is a problem Intended sense of wt = argmax P(f|s)P(s) s in S s in S 10 Independence assumption P(f|s) approximated by product of individual features: ΠP(fj|s) P(fj|s) = count(fj,s)/count(s) P(si) = count(si,wt)/count(wt) Systems using Naïve Bayes have achieved accuracies in the range of 62 to 72% with adequate training data all j 11 Ordered list of strong clues/features to the senses of the target 12 Learned decision list for each target word Bootstrapped from seeds, very large corpus, heuristics ◦ One sense per discourse ◦ One sense per collocation Used supervised algorithm to build decision list Corpus: 460M words, mixed texts 13 Think of seed features for each sense ◦ Manufacturing in context of plant: “industrial building” sense ◦ Life in context of plant: “the living thing” sense Compile the first set of training data 141516 Create a new decision-list classifier: ◦ supervised training with the data tagged so far (training set). ◦ Looks for collocations as features for classification. Apply new classifier to the test set (remaining data) ◦ tag some new instances. Optional: Apply one-sense-per-discourse rule wherever one sense now dominates a text. ◦ Co-training 171819 Stop when: ◦ Error on training data is less than a threshold ◦ No more training data is covered Use final decision list for WSD Performance was shown to be as good as a supervised algorithm 20 Strength of method: ◦ The one-sense heuristics. ◦ Automatically generated a huge training corpus. ◦ Bootstrapping Unsupervised use of supervised algorithm. Disadvantages: ◦ Train each word separately. ◦ Works well for homonyms only. ◦ Danger of snowballing error with co-training. 21 Unsupervised WSD approaches: ◦ choose that sense of the target which is closest in meaning to the context of the target word ◦ E.g.: the Lesk algorithm Supervised WSD approaches: ◦ Choose that sense of the target whose context is closest to the training set contexts of that sense. ◦ E.g.: bag-of-words features approach using decision lists, naïve-bayes, Yarowsky 1995 method 22 Word matching helps determine similarity ◦ As in the Lesk algorithm But it is very limited ◦ What about word pairs that have different word forms but yet are close in meaning ◦ There are hundreds of thousands of such word pairs. 23The bench dismissed the case. Bench ◦ a long seat for two or more persons ◦ the persons who sit as judges ◦ a former wave-cut shore of a sea or lake or floodplain of a river Case ◦ a set of circumstances or conditions ◦ a suit or action in law or equity 24The bench dismissed the case. Bench ◦ a long seat for two or more persons ◦ the persons who sit as judges ◦ a former wave-cut shore of a sea or lake or floodplain of a river Case ◦ a set of circumstances or conditions ◦ a suit or action in law or equity 25 Cognate identification Coreference resolution Document clustering Information retrieval Multiword expression identification Paraphrasing and textual entailment Question answering Real-word spelling error detection Relation extraction Semantic similarity of texts Speech recognition Subjectivity determination Summarization
View Full Document