New version page

CORNELL CS 674 - Natural Language Processing

Upgrade to remove ads
Upgrade to remove ads
Unformatted text preview:

CS674 Natural Language Processing• Last classes– Noisy channel model– N-gram models• Today – Part-of-speech tagging• introductionPart of speech tagging“There are 10 parts of speech, and they are all troublesome.”-Mark Twain• POS tags are also known as word classes, morphological classes, or lexical tags.• Typically much larger than Twain’s 10:– Penn Treebank: 45– Brown corpus: 87– C7 tagset: 146Part of speech tagging• Assign the correct part of speech (word class) to each word/token in a document“The/DT planet/NN Jupiter/NNP and/CC its/PRP moons/NNS are/VBP in/IN effect/NN a/DT mini-solar/JJ system/NN ,/, and/CC Jupiter/NNP itself/PRP is/VBZ often/RB called/VBN a/DT star/NN that/IN never/RB caught/VBN fire/NN ./.”• Needed as an initial processing step for a number of language technology applications– Answer extraction in QA– Base step in identifying syntactic phrases for IR systems– Critical for word-sense disambiguation (WordNet apps)– Information extraction–…Why is p-o-s tagging hard?• Ambiguity– He will race/VB the car.– When will the race/NOUN end?– The boat floated/VBD down the river.• Average of ~2 parts of speech for each word• The number of tags used by different systems varies a lot. Some systems use < 20 tags, while others use > 400.VBN down the river sank.Hard for Humans• particle vs. preposition – He talked over the deal.– He talked over the telephone.• past tense vs. past participle– The horse walked past the barn.– The horse walked past the barn fell.• noun vs. adjective?– The executive decision.• noun vs. present participle – Fishing can be fun.From Ralph Grishman, NYUTo obtain gold standards for evaluation, annotators rely on a set of tagging guidelines.Penn Treebank TagsetAmong easiest of NLP problems• State-of-the-art methods achieve ~97% accuracy.• Simple heuristics can go a long way. – ~90% accuracy just by choosing the most frequent tag for a word (MLE)– To improve reliability: need to use some of the local context.• But defining the rules for special cases can be time-consuming, difficult, and prone to errors and omissionsApproaches1. rule-based: involve a large database of hand-written disambiguation rules, e.g. that specify that an ambiguous word is a noun rather than a verb if it follows a determiner.2. probabilistic: resolve tagging ambiguities by using a training corpus to compute the probability of a given word having a given tag in a given context.- HMM tagger, Maximum Likelihood Tagger3. hybrid corpus-/rule-based: E.g. transformation-based tagger (Brill tagger); learns symbolic rules based on a corpus.4. ensemble methods: combine the results of

View Full Document
Download Natural Language Processing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...

Join to view Natural Language Processing and access 3M+ class-specific study document.

We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Natural Language Processing 2 2 and access 3M+ class-specific study document.


By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?