1CS 294-5: StatisticalNatural Language ProcessingPOS Tagging IILecture 8: 9/26/05Recap: POS Ambiguity Words are syntactically ambiguous: Two sources of information: Clues from the input (current word, next word, capitalization, suffixes, word shape) Clues from adjacent hidden labels (connectivity) What of this could HMMs capture? Remember: POS sequence models will be the basis of information extraction methods laterFed raises interest rates 0.5 percentNNP NNS NN NNS CD NNVBN VBZ VBP VBZVBD VB Recap: Accuracies Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% Maxent P(t|w): 93.7% / 82.6% TnT (HMM++): 96.2% / 86.0% Maxent tagger: 96.9% / 86.9% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98%Most errors on unknown wordsRecap: Errors Common errors [from Toutanova & Manning 00]NN/JJ NNofficial knowledgeVBD RP/IN DT NNmade up the storyRB VBD/VBN NNSrecently sold sharesBetter Features Can do surprisingly well just looking at a word by itself: Word the: the → DT Lowercased word Importantly: importantly → RB Prefixes unfathomable: un- → JJ Suffixes Importantly: -ly → RB Capitalization Meridian: CAP → NNP Word shapes 35-year: d-x → JJ Then build a maxent (or whatever) model to predict tag Maxent P(t|w): 93.7% / 82.6%Sequence-Free Tagging? What about looking at a word and it’s environment, but no sequence information? Add in previous / next word the __ Previous / next word shapes X __ X Occurrence pattern features [X: x X occurs] Crude entity detection __ ….. (Inc.|Co.) Phrasal verb in sentence? put …… __ Conjunctions of these things All features except sequence: 96.6% / 86.8% Uses lots of features: > 200K Why isn’t this the standard approach?2Maxent Taggers One step up: also condition on previous tags Train up P(ti|w,ti-1,ti-2) as a normal maxent problem, then use to score sequences This is referred to as a maxent tagger [Ratnaparkhi96] Beam search effective! (Why?) What’s the advantage of beam size 1?Feature Templates We’ve been sloppy: Features: <w0=future, t0=JJ> Feature templates: <w0, t0> In maxent taggers: Can now add edge feature templates: < t-1, t0> < t-2, t-1, t0> Also, mixed feature templates: < t-1, w0 , t0 > Decoding Decoding maxent taggers: Just like decoding HMMs Viterbi, beam search, posterior decoding Viterbi algorithm (HMMs): Viterbi algorithm (Maxent):TBL Tagger [Brill 95] presents a transformation-based tagger Label the training set with most frequent tagsDT MD VBD VBD .The can was rusted . Add transformation rules which reduce training mistakes MD → NN : DT __ VBD → VBN : VBD __ . Stop when no transformations do sufficient good Does this remind anyone of anything? Probably the most widely used tagger (esp. outside NLP) … but not the most accurate: 96.6% / 82.0 %TBL Tagger II What gets learned? [from Brill 95]EngCG Tagger English constraint grammar tagger [Tapanainen and Voutilainen 94] Something else you should know about Hand-written and knowledge driven “Don’t guess if you know” (general point about modeling more structure!) Tag set doesn’t make all of the hard distinctions as the standard tag set (e.g. JJ/NN) They get stellar accuracies: 98.5% on their tag set Linguistic representation matters… … but it’s easier to win when you make up the rules3CRF Taggers Newer, higher-powered discriminative sequence models CRFs (also voted perceptrons, M3Ns) Do not decompose training into independent local regions Can be deathly slow to train – require repeated inference on training set Differences tend not to be too important for POS tagging However: one issue worth knowing about in local models “Label bias” and other explaining away effects Maxent taggers’ local scores can be near one without having both good “transitions” and “emissions” This means that often evidence doesn’t flow properly Why isn’t this a big deal for POS tagging?Domain Effects Accuracies degrade outside of domain Up to triple error rate Usually make the most errors on the things you care about in the domain (e.g. protein names) Open questions How to effectively exploit unlabeled data from a new domain (what could we gain?) How to best incorporate domain lexica in a principled way (e.g. UMLS specialist lexicon, ontologies)Unsupervised Tagging? AKA part-of-speech induction Task: Raw sentences in Tagged sentences out Obvious thing to do: Start with a (mostly) uniform HMM Run EM Inspect resultsEM for HMMs: Quantities Remember from last time: Can calculate in O(s2n) time (why?)EM for HMMs: Process From these quantities, we can re-estimate transitions: And emissions: If you don’t get these formulas immediately, just think about hard EM instead, where were re-estimate from the Viterbi sequencesMerialdo: Setup Some (discouraging) experiments [Merialdo 94] Setup: You know the set of allowable tags for each word Fix k training examples to their true labels Set P(w|t) on these examples Set P(t|t-1,t-2) on these examples Re-estimate with EM for n iterations Note: we know allowed tags but not frequencies4Merialdo: Results So How to Fix It? Lots of progress in learning parts-of-speech Distributional word clustering methods Morphology-driven models Contrastive estimation Other ideas! Stay
View Full Document