Unformatted text preview:

CS224N Final project: Language Model Methods and MetricsGary LuuStanford [email protected] FortuneStanford [email protected] 7, 2008AbstractWe first present skip n-grams interpolated with various other n-g rams and measure theirability to improve performance of the language models in terms of perplexity. We then presenta language model that relies on content words, words which are uncommon. We then presenta bag generation algorithm, and use bag generation as a metric for our language models. Wethen present a language model that uses clustering by part of speech tags.1 IntroductionIn this paper, we examine various method s of improving language models, as well as discuss apossible metric. All language models were trained and tested using the europarl corpus. Allinterpolated models have their weights λicalibrated by using a discretized sample space over [0, 1]and iterating the discrete weight values. The typical discretization was 0.05.Most of our methods are motivated by a tutorial presented by Goodman, titled The State ofthe Art in Language Modeling [4].2 Skip N -gramsOne of the limitations of n-grams is that as n grows, the memory requirements grow rapidly due todimensionality, and also because of dimens ionality, our training data becomes more sparse relativeto the possible number of n-grams. One way to extend the influence of words further away fromour query word without the excessive memory requirement is to use skip n-grams. Skip n-gramsare also useful in many different permutations, for example, a 3-dimensional s kip n-gram that useswi−1and wi−3as the history, i.e. P (wi|wi−1wi−3).2.1 MethodOur method was to take the unsmoothed skip n-gram P (wi|wi−2) and see how performance wouldimprove when we interpolated it with a bigram as we increased the training set size, and alsocompare it to a standard trigram model. That way we could see the utility in using the skipn-gram as a replacement for the trigram.150 100 150 200 250 300 350 400 450 500120130140150160170180190200210220Training Set Size (in Thousands of Sentences)Test Set PerplexityLearning Curves Witten−Bell BigramWitten−Bell Bigram and P(w|wi−2)Witten−Bell TrigramFigure 1: Skip n-gram performance relative to bigram and trigram baselines2.2 Learning CurveHere we show the value of skip n-grams as the training set size grows relative to the baseline bigramand trigram models, which use Witten-Bell smoothing. See Figure 1.2.2.1 DiscussionAs one can see in the data, using a skip n-gram with a bigram model did cause perplexity scores todecrease similarly to th e bigram, with a slight constant difference. The trigram model had a muchsteeper amount of performance improvement with more data. It is likely that the in creased trainingdata is more beneficial for higher dimensional models than lower dimensional models, which do notsuffer as much from sparsity with the s maller data sets as the higher dimensional models do.3 Content Word ModelOftentimes, n-grams perform poorly because their history is limited. Because of this limited history,phrases can often be rearranged and be scored similarly by the n-gram. In essence, we attemptedto capture the idea of sentence context using a “content” word model, where a “content” word is2one which is uncommon, since the most common words are articles and prepositions. However,since prepositions and articles are requ ired to be grammatically correct and fluent, a content wordmodel could not strictly be used by itself.We created the content word model as follows. Let S be a sentence. Let C be the last contentword foun d in the sentence, where C is the special token ǫ if we have not seen any content words.Then our language model isP (S) =Yiλ1P (wi|wi−1i−n+1) + λ2P (wi|C) Where P (wi|wi−1i−n+1) can be replaced by any arbitrary n-gram like language model, and λ1+λ2= 1.P (wi|C) is computed from the joint bigram distribution of content words and all words like soP (wi|C) =P (w, C)PwP (w, C)Where the bigram distribution is calculated using a simple MLE count divided by the total numberof content words and all words count. Note that a content word C 6= wiwhen compu ting P (wi|C)unless C occurs more than once in the sentence. Also note that C is not a fixed position away fromC. We did not perform smoothing on th is model. C is always the most recent content word in thehistory.3.1 Content Word DefinitionWe examined using different sets of words as the non-content words. Our method was to find a listof the most commonly used words in the English language, and look for a cutoff where we couldimprove the perplexity of the language model. We started with an initial set of 250 m ost commonenglish words, in order, obtained online thr ough an internet search [7].3.2 MethodWe interp olated the content word model with various other models and plotted the performanceimprovement. We r ed uced the common words list down to the 25 most common words, as largerlists had no effect. We tested this content model with Kneser-Ney and Good -Turing smoothing.See Figure 2.3.3 DiscussionThe content word model provided little to no benefit with these language models. In the case of thetrigram model, there was absolutely no benefit at all. This is likely due to the fact that the commonword removal schemes behave somewhat like tr igram and skip n-grams, but with little to add interms of grammatical fluency. It would be worth looking into building a composite bigram/contentword model P (wi|wi−1, C) in f uture work.4 Bag Generation Based MetricsWe developed a metric based on bag generation, as discussed in exercise 4.9 in Jurafsky an d Martin[1]. More specifically, the bag generation task uses a language model and s earches for the most31 2 3200205210215220225230235Interpolated Content ModelsTest Set Perplexity w/ Content Modelw/o Content ModelGood−Turing BigramKneser−Ney BigramGood−Turing TrigramFigure 2: Perf ormance of Various Language Models Interpolated with C ontent Modelprobable sequence of words given a bag (a set) of words. This task is NP-complete. Informally,this task is very similar to machine translation tasks without constraints to make it solvable inpolynomial time. Arbitrary word-ordering is known to be NP-complete [3]. Therefore, we resortedto greedy algorithms in order to ru n this metric.4.1 Greedy Algorithm DescriptionThe algorithm can be essentially described as random-restart greedy hill-climbing. We initializethe


View Full Document
Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?