CORNELL CS 674 - Natural Language Processing - D2451547

Home> Schools> Cornell University> Computer Science (CS) > CS 674> Natural Language Processing

DOC PREVIEW

CORNELL CS 674 - Natural Language Processing

School name Cornell University

Course Cs 674- Advanced Language Techologies

Pages 2

This preview shows page 1 out of 2 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 2 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 2 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS674 Natural Language Processing• Last classes– Noisy channel model– N-gram models• Today – Part-of-speech tagging• introductionPart of speech tagging“There are 10 parts of speech, and they are all troublesome.”-Mark Twain• POS tags are also known as word classes, morphological classes, or lexical tags.• Typically much larger than Twain’s 10:– Penn Treebank: 45– Brown corpus: 87– C7 tagset: 146Part of speech tagging• Assign the correct part of speech (word class) to each word/token in a document“The/DT planet/NN Jupiter/NNP and/CC its/PRP moons/NNS are/VBP in/IN effect/NN a/DT mini-solar/JJ system/NN ,/, and/CC Jupiter/NNP itself/PRP is/VBZ often/RB called/VBN a/DT star/NN that/IN never/RB caught/VBN fire/NN ./.”• Needed as an initial processing step for a number of language technology applications– Answer extraction in QA– Base step in identifying syntactic phrases for IR systems– Critical for word-sense disambiguation (WordNet apps)– Information extraction–…Why is p-o-s tagging hard?• Ambiguity– He will race/VB the car.– When will the race/NOUN end?– The boat floated/VBD down the river.• Average of ~2 parts of speech for each word• The number of tags used by different systems varies a lot. Some systems use < 20 tags, while others use > 400.VBN down the river sank.Hard for Humans• particle vs. preposition – He talked over the deal.– He talked over the telephone.• past tense vs. past participle– The horse walked past the barn.– The horse walked past the barn fell.• noun vs. adjective?– The executive decision.• noun vs. present participle – Fishing can be fun.From Ralph Grishman, NYUTo obtain gold standards for evaluation, annotators rely on a set of tagging guidelines.Penn Treebank TagsetAmong easiest of NLP problems• State-of-the-art methods achieve ~97% accuracy.• Simple heuristics can go a long way. – ~90% accuracy just by choosing the most frequent tag for a word (MLE)– To improve reliability: need to use some of the local context.• But defining the rules for special cases can be time-consuming, difficult, and prone to errors and omissionsApproaches1. rule-based: involve a large database of hand-written disambiguation rules, e.g. that specify that an ambiguous word is a noun rather than a verb if it follows a determiner.2. probabilistic: resolve tagging ambiguities by using a training corpus to compute the probability of a given word having a given tag in a given context.- HMM tagger, Maximum Likelihood Tagger3. hybrid corpus-/rule-based: E.g. transformation-based tagger (Brill tagger); learns symbolic rules based on a corpus.4. ensemble methods: combine the results of

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 2 pages.

CORNELL CS 674 - Natural Language Processing

Sign up for free to view:

Please select your school