DOC PREVIEW
Berkeley COMPSCI 294 - POS Tagging and HMMs

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 294-5: StatisticalNatural Language ProcessingPOS Tagging and HMMsLecture 7: 9/21/05Parts-of-Speech Syntactic classes of words Useful distinctions vary from language to language Tagsets vary from corpus to corpus [See M+S p. 142] Some tags from the Penn tagsettwist appear comprise mold postponeverb, present tense, not 3rd person singular VBP dilapidated imitated reunifed unsettledverb, past participle VBN pleaded swiped registered sawverb, past tense VBD ask bring fire see takeverb, base form VB aboard away back by on open throughparticle RP occasionally maddeningly adventurouslyadverb RB hers himself it we thempronoun, personal PRP Motown Cougar Yvette Liverpoolnoun, proper, singular NNP cabbage thermostat investment subhumanitynoun, common, singular or mass NN can may might will would modal auxiliary MD third ill-mannered regrettableadjective or numeral, ordinal JJ among whether out on by ifpreposition or conjunction, subordinating IN a all an every no that thedeterminer DT mid-1890 nine-thirty 0.5 onenumeral, cardinal CD however whenever where why Wh-adverb WRB whose WH-pronoun, possessive WP$ that what whatever which who whomWH-pronoun WP that what whatever which whichever WH-determiner WDT bases reconstructs marks usesverb, present tense, 3rd person singular VBZ twist appear comprise mold postponeverb, present tense, not 3rd person singular VBP dilapidated imitated reunifed unsettledverb, past participle VBN stirring focusing approaching erasingverb, present participle or gerund VBG pleaded swiped registered sawverb, past tense VBD ask bring fire see takeverb, base form VB huh howdy uh whammo shucks heckinterjection UH to "to" as preposition or infinitive marker TO aboard away back by on open throughparticle RP best biggest nearest worst adverb, superlative RBS further gloomier heavier less-perfectlyadverb, comparative RBR occasionally maddeningly adventurouslyadverb RB her his mine my our ours their thy your pronoun, possessive PRP$ hers himself it we thempronoun, personal PRP ' 's genitive marker POS undergraduates bric-a-brac averagesnoun, common, plural NNS Americans Materials Statesnoun, proper, plural NNPS Motown Cougar Yvette Liverpoolnoun, proper, singular NNP cabbage thermostat investment subhumanitynoun, common, singular or mass NN can may might will would modal auxiliary MD bravest cheapest tallestadjective, superlative JJS braver cheaper talleradjective, comparative JJR third ill-mannered regrettableadjective or numeral, ordinal JJ among whether out on by ifpreposition or conjunction, subordinating IN gemeinschaft hund ich jeuxforeign word FW there existential there EX a all an every no that thedeterminer DT mid-1890 nine-thirty 0.5 onenumeral, cardinal CD and both but either orconjunction, coordinating CC Part-of-Speech Ambiguity Example Two basic sources of constraint: Grammatical environment Identity of the current word Many more possible features: … but we won’t be able to use them until next classFed raises interest rates 0.5 percentNNP NNS NN NNS CD NNVBN VBZ VBP VBZVBD VB Why POS Tagging? Useful in and of itself Text-to-speech: record, lead Lemmatization: saw[v] → see, saw[n] → saw Quick-and-dirty NP-chunk detection: grep {JJ | NN}* {NN | NNS} Useful as a pre-processing step for parsing Less tag ambiguity means fewer parses However, some tag choices are better decided by parsers!DT NN IN NN VBD NNS VBDThe average of interbank offered rates plummeted …DT NNP NN VBD VBN RP NN NNSThe Georgia branch had taken on loan commitments …INVDNHMMs We want a generative model over sequences t and observations w using states s Assumptions: Tag sequence is generated by an order n markov model This corresponds to a 1storder model over tag n-grams Words are chosen independently, conditioned only on the tag These are totally broken assumptions: why?∏−−=iiiiiitwPtttPWTP )|(),|(),(21<♦,♦>∏−=iiiiiswPssPWTP )|()|(),(1s1s2snw1w2wns0< ♦, t1>< t1, t2>< tn-1, tn>2Parameter Estimation Need two multinomials Transitions: Emissions: Can get these off a collection of tagged sentences),|(21 −− iiitttP)|(iitwPPractical Issues with Estimation Use standard smoothing methods to estimate transition scores, e.g.: Emissions are tricker Words we’ve never seen before Words which occur with tags we’ve never seen One option: break out the Good-Turning smoothing Issue: words aren’t black boxes: Another option: decompose words into features and use a maxent model along with Bayes’ rule.)|(ˆ),|(ˆ),|(1121221 −−−−−+=iiiiiiiittPtttPtttPλλ343,127.23 11-year Minteria reintroducible)(/)()|()|( tPwPwtPtwPMAXENT=Disambiguation Given these two multinomials, we can score any word / tag sequence pair In principle, we’re done – list all possible tag sequences, score each one, pick the best one (the Viterbi state sequence) Fed raises interest rates 0.5 percent .NNP VBZ NN NNS CD NN .P(NNP|<♦,♦>) P(Fed|NNP) P(VBZ|<NNP,♦>) P(raises|VBZ) P(NN|VBZ,NNP)…..NNP VBZ NN NNS CD NNNNP NNS NN NNS CD NNNNP VBZ VB NNS CD NNlogP = -23logP = -29logP = -27<♦,♦><♦,NNP><NNP, VBZ> <VBZ, NN> <NN, NNS> <NNS, CD> <CD, NN> <STOP>Finding the Best Trajectory  Too many trajectories (state sequences) to list Option 1: Beam Search A beam is a set of partial hypotheses Start with just the single empty trajectory At each derivation step: Consider all continuations of previous hypotheses Discard most, keep top k, or those within a factor of the best, (or some combination) Beam search works relatively well in practice … but sometimes you want the optimal answer … and you need optimal answers to validate your beam search<>Fed:NNPFed:VBNFed:VBDFed:NNP raises:NNSFed:NNP raises:VBZFed:VBN raises:NNSFed:VBN raises:VBZThe Path Trellis Represent paths as a trellis over states Each arc (s1:i → s2:i+1) is weighted with the combined cost of: Transitioning from s1to s2(which involves some unique tag t) Emitting word i given t Each state path (trajectory): Corresponds to a derivation of the word and tag sequence pair Corresponds to a unique sequence of part-of-speech tags Has a probability given


View Full Document

Berkeley COMPSCI 294 - POS Tagging and HMMs

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download POS Tagging and HMMs
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view POS Tagging and HMMs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view POS Tagging and HMMs 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?