Stanford CS 224 - Study Notes - D2850759

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Study Notes

DOC PREVIEW

Stanford CS 224 - Study Notes

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS224n Winter 2011 Part of Speech Tagging with Discriminatively Re-ranked Hidden Markov Models Brian Highfill Stanford University Department of Computer Science Abstract- The task of part of speech tagging has been approached by various ways. Originally, constructed by way of hand-crafted rules for disambiguation, the majority of tagging is now accomplished by utilizing statistical machine learning methods. Two commonly applied statistical methods are hidden Markov models (HMM) and an extension of Markov chains combined with a maximum entropy classifier called maximum entropy Markov models (MEMM). This paper explores POS tagging by combining a standard HMM tagger separately with a maximum entropy classifier designed to re-rank the best sequence of tags produced by the HMM. Tested on the Brown corpus (Francis, 1979; Francis and Kucera, 1982), the discriminatively re-ranked tagger performed with an accuracy of 91.8%, slightly less than that of the HHM tagger alone. I. INTRODUCTION A common task in natural language processing is part of speech (POS) tagging. Automatic part of speech tagging is used, for instance, to aid speech synthesis systems in pronunciation and information retrieval (IR) in the context of parsing and word sense disambiguation (Jurafsky, Martin, 2009). Statistical POS tagging commonly involves using a corpus of sentences, in a particular language, which has been already tagged with part of speech information. There has also been research on unsupervised methods when a pre-tagged corpus is not available (Jurafsky, Martin, 2009). The machine learning methods used on supervised POS tagging use the statistics of word context to learn models of how part of speech is associated with words in a sentence. Once a model has been learned, some form of inference is required to select the best sequence of tags for a particular sentence. This is also known as decoding. One common statistical method attempts to learn a hidden Markov model of POS attachment. HMM’s are generative algorithms which treats part of speech tags as unseen “hidden” states and the words of a sentence as observations. The model assumes the tags form a Markov sequence such that each tag transitions to the next tag in a sentence with a probability only dependent on the previous tag. At each step of the generative process, a given tag “produces” a word from a probability distribution dependent only on that tag. The goal of a HMM POS tagging is to first learn these model parameters using a pre-tagged corpus. When one of these tagged “gold standard” training sets is available, the transition and observation probabilities can be estimated directly by counting methods. When used on its own, HMM POS tagging utilizes the Viterbi algorithm to generate the optimal sequence of tags for a given sentence. Maximum entropy classification is another machine learning method used in POS tagging. MaxEnt models are discriminative and can be used to directly model the posterior probability of a POS tag given a word and its context in the sentence. The MaxEnt method is based on a number of features (in many cases, binary features) which indicate certain properties of context which are useful for making a tag decisions. It is common to apply the supervised learning method of expectation-maximization (EM),CS224n Winter 2011 an iterative algorithm, to estimate the model parameters (Jurafsky, Martin, 2009). Once a MaxEnt model is learned, a slightly modified version of the Viterbi algorithm can be applied to a Markov chain of tags to select the best sequence of tags. Using a Markov chain of tags in this way together with a MaxEnt model for POS tagging is called maximum entropy Markov modeling (MEMM). II. METHODS A. Part of Speech Tagging Given a sequence (sentence) of words with words, we seek the sequence of tags of length which has the largest posterior: Using a hidden Markov models, or a MaxEnt model, we will be able to estimate this posterior. For our experiment, we built a HMM and MaxEnt model separately, then used a beam search with the HMM to produce the (approximately) top most likely sequences. Finally, we used our MaxEnt model to score each of these sequences and picked the top choice as the tagging for each sentence. B. Hidden Markov Models Using Bayes’ rule, the posterior above can be rewritten as: That is, as a product of a likelihood and prior respectively. Furthermore, making the (Markov) assumption that part of speech tags transition from one to another depending only on the last tag and that observation (word) probabilities only depend on the current tag, the above simplifies to: Each of these parameters can be learned independently via counting: And A small constant was added to each of the maximum likelihood estimates for smoothing (for our experiment, we used L = 0.001). Using a value of L much smaller than one (Laplace smoothing) results in better smoothing. This is because less probability mass will be “stolen” from non-zero counts. The constants in front of each equation above are chosen so that each distribution sum to one over each of the unconditioned variables. In practice, this is done by simply dividing each distribution by the post-smoothed sum. Additionally, every HMM is also parameterized by a set of initial state (tag) probabilities. These were computed from counts of the first tag in each of the training sentences: Dealing with words unseen in the training set is an issue that also face. To model them, we estimated the probability of an unknown word to be the fraction of words from the training sentences observed only once. Then for each training tag, we modified the above expression for adding an unknown word token with a conditional probability as give above. Finally, each is renormalized to a proper distribution. C. MaxEnt Model The MaxEnt model

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 7 pages.

Stanford CS 224 - Study Notes

Sign up for free to view:

Please select your school