DOC PREVIEW
CORNELL CS 674 - Natural Language Processing

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS674 Natural Language Processing Last week– Word sense disambiguation Today– SENSEVAL– Noisy channel model» Pronunciation variation in speech recognitionSENSEVAL-2 2001 Three tasks– Lexical sample– All-words– Translation 12 languages Lexicon– SENSEVAL-1: from HECTOR corpus– SENSEVAL-2: from WordNet 1.7 93 systems from 34 teamsLexical sample task Select a sample of words from the lexicon Systems must then tag instances of the sample words in short extracts of text SENSEVAL-1: 35 words, 41 tasks– 700001 John Dos Passos wrote a poem that talked of `the <tag>bitter</> beat look, the scorn on the lip." – 700002 The beans almost double in size during roasting. Black beans are over roasted and will have a <tag>bitter</> flavour and insufficiently roasted beans are pale and give a colourless, tasteless drink. Lexical sample task: SENSEVAL-1Nouns Verbs Adjectives Indeterminates-n N -v N -a N -p Naccident 267 amaze 70 brilliant 229 band 302behaviour 279 bet 177 deaf 122 bitter 373bet 274 bother 209 floating 47 hurdle 323disability 160 bury 201 generous 227 sanction 431excess 186 calculate 217 giant 97 shake 356float 75 consume 186 modest 270giant 118 derive 216 slight 218… … … … … …TOTAL 2756 TOTAL 2501 TOTAL 1406 TOTAL 1785All-words task Systems must tag almost all of the content words in a sample of running text–sense-tag all predicates, nouns that are heads of noun-phrase arguments to those predicates, and adjectives modifying those nouns–~5,000 running words of text–~2,000 sense-tagged wordsTranslation task SENSEVAL-2 task Only for Japanese word sense is defined according to translation distinction– if the head word is translated differently in the given expressional context, then it is treated as constituting a different sense word sense disambiguation involves selecting the appropriate English word/phrase/sentence equivalent for a Japanese word SENSEVAL-2 results SENSEVAL-2 de-briefing Where next?– Supervised ML approaches worked best» Looking at the role of feature selection algorithms– Need a well-motivated sense inventory» Inter-annotator agreement went down when moving to WordNet senses– Need to tie WSD to real applications» The translation task was a good initial attemptSENSEVAL-3 2004 14 core WSD tasks including– All words (Eng, Italian): 5000 word sample– Lexical sample (7 languages) Tasks for identifying semantic roles, for multilingual annotations, logical form, subcategorization frame acquisitionEnglish lexcial sample task Data collected from the Web from Web users Guarantee at least two word senses per word 60 ambiguous nouns, adjectives, and verbs test data – ½ created by lexicographers – ½ from the web-based corpus Senses from WordNet 1.7.1 and Wordsmyth (verbs) Sense maps provided for fine-to-coarse sense mapping Filter out multi-word expressions from data setsEnglish lexical sample task Results 27 teams, 47 systems Most frequent sense baseline – 55.2% (fine-grained)– 64.5% (coarse) Most systems significantly above baseline– Including some unsupervised systems Best system– 72.9% (fine-grained)– 79.3% (coarse)The pronunciation subproblem[spooky music][music stops]Head Knight of Ni:Ni!Knights of Ni:Ni! Ni! Ni! Ni! Ni!Arthur:Who are you?Head Knight:We are the Knights Who Say…’Ni’! …We are the keepers of the sacred words: ‘Ni’, ‘Peng’, and ‘Neee-wom’!The pronunciation subproblem Given a series of phones, compute the most probable word that generated them. Simplifications– Given the correct string of phones» Speech recognizer relies on probabilistic estimators for each phone, so it’s never entirely sure about the identification of any particular phone– Given word boundaries “I [ni]…”– [ni] Æ neat, the, need, new, knee, to, and you– Based on the (transcribed) Switchboard corpus Contextually-induced pronunciation variationProbabilistic transduction surface representation Æ lexical representation string of symbols representing the pronunciation of a word in context Æ string of symbols representing the dictionary pronunciation– [er] Æ her, were, are, their, your– exacerbated by pronunciation variation» the pronounced as THEE or THUH» some aspects of this variation are systematic sequence of letters in a mis-spelled word Æsequence of letters in the correctly spelled word– acress Æ actress, cress, acresNoisy channel model Channel introduces noise which makes it hard to recognize the true word. Goal: build a model of the channel so that we can figure out how it modified the true word…so that we can recover it.Decoding algorithm Special case of Bayesian inference– Bayesian classification» Given observation, determine which of a set of classes it belongs to.» Observationstring of phones» Classify as aword in the languagePronunciation subproblem Given a string of phones, O (e.g. [ni]), determine which word from the lexicon corresponds to it– Consider all words in the vocabulary, V– Select the single word, w, such that P (word w | observation O) is highest)|(maxargˆOwPwVw∈=Bayesian approach Use Bayes’ rule to transform into a product of two probabilities, each of which is easier to compute than P(w|O))()()|(maxargˆOPwPwOPwVw∈=likelihood prior)()()|()|(yPxPxyPyxP =Computing the prior• Using the relative frequency of the word in a large corpus – Brown corpus and Switchboard Treebank.0012625new.000561417need.00013338neat.046114,834the.00002461kneeP(w)freq(w)w Take the rules of pronunciation (see chapter 4 of J&M) and associate them with probabilities– Nasal assimilation rule Compute the probabilities from a large labeled corpus (like the transcribed portion of Switchboard) Run the rules over the lexicon to generate different possible surface forms each with its own probabilityProbabilistic rules for generating pronunciation likelihoodsSample rules that account for [ni]Final results new is the most likely Turns out to be wrong – “I


View Full Document
Download Natural Language Processing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Natural Language Processing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Natural Language Processing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?