DOC PREVIEW
MSU CSE 842 - LECTURE NOTES
Course Cse 842-
Pages 2

This preview shows page 1 out of 2 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

400 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASP-35, NO. 3, MARCH 1987 Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer SLAVA M. KATZ Abstract-The description of a novel type of rn-gram language model is given. The model offers, via a nonlinear recursive procedure, a com- putation and space efficient solution to the problem of estimating prob- abilities from sparse data. This solution compares favorably to other proposed methods. While the method has been developed for and suc- cessfully implemented in the IBM Real Time Speech Recognizers, its generality makes it applicable in other areas where the problem of es- timating probabilities from sparse data arises. Sparseness of data is an inherent property of any real text, and it is a problem that one always encounters while collecting fre- quency statistics on words and word sequences (m-grams) from a text of finite size. This means that even for a very large data col- lection, the maximum likelihood estimation method does not allow us to adequately estimate probabilities of rare but nevertheless pos- sible word sequences-many sequences occur only once (“single- tons”); many more do not occur at all. Inadequacy of the maximum likelihood estimator and the necessity to estimate the probabilities of m-grams which did not occur in the text constitute the essence of the problem. The main idea of the proposed solution to the problem is to re- duce unreliable probability estimates given by the observed fre- quencies and redistribute the “freed” probability “mass” among m-grams which never occurred in the text. The reduction is achieved by replacing maximum likelihood estimates for m-grams having low counts with renormalized Turing’s estimates [l], and the re- distribution is done via the recursive utilization of lower level con- ditional distributions. We found Turing’s method attractive be- cause of its simplicity and its characterization as the optimal empirical Bayes’ estimator of a multinomial probability. Robbins in [2] introduces the empirical Bayes’ methodology and Nadas in [3] gives various derivations of the Turing’s formula. Let N be a sample text size and let n, be the number of words (m-grams) which occurred in the text exactly r times, so that N = C rn,. (1) Turing’s estimate PT for a probability of a word (m-gram) which occurred in the sample r times is r r* PT = where We call a procedure of replacing a count r with a modified count r’ “discounting” and a ratio rt/r a discount coefficient dr. When r’ = r*, we have Turing’s discounting. Let us denote the m-gram wl, * . . , w, as wy and the number of times it occurred in the sample text as c(wT). Then the maxi- mum likelihood estimate is Manuscript received May 1, 1986; revised September 2, 1986. The author is with the Continuous Speech Recognition Group, Depart- ment of Computer Sciences, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598. IEEE Log Number 8611734. and the Turing’s estimate is where .*(x) = (.(x) + 1) nro+l* fkx) It follows from (1)-(3) that the total probability estimate, using (5), for the set of words (m-gram) that actually occurred in the sample is c PT(W;’) = 1 - - nl w;“: c(wf)> 0 N This, in turn, leads to the estimate for the probability of observing some previously unseen m-gram as a fraction nl /N of “single- tons” in the text: On the other hand, where Thus, Our method is based on the interpretation of 6, in (1 1) as a “con- tribution” of an m-gram w;’ with a count c(wy) to the probability of “unseen” m-grams. We go further with this interpretation for the estimating of conditional probabilities P(w, I wy - ’ ). Assum- ing that the (m - 1)-gram w;‘-’ has been observed in the sample text, we introduce an entity ijynd) analogous to 6,, given by (10) We now define our estimator P,( w, 1 w;”- ) inductively as follows. Assume that the lower level conditional estimator P,( w, I w;”-’ ) has been defined. Then, when c(w7-l) > 0, gives the conditional probability estimate for words w, observed in the sample texl after w7-I (c(w7 ) > 0). It is convenient to define a function /3 by This gives an estimate of the sum of conditional probabilities of all words w, which never followed wY-’(c(wy) = 0). We distribute the probability “mass” 0, defined by (14), among w, for which c(w7) = 0 using a previously defined (by induction) lower level conditional distribution P, (w, 1 WT - I ): Ps(wJW~-I) = .P,(w,Iwy-’) (15) 0096-3518/87/0300-0400$01 .OO 0 1987 IEEEIEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-35, NO. 3, MARCH 1987 40 1 where 1 - em P(w,jwy-’) wm:C(w,)>O - - 1 - em P(w,(wy-’) w,:c(w,)>O is a normalizing constant. When c(w7-I) = 0, then we define P,(w,JwT-’) = Ps(w,\wy-’). (17) Complementing P and definitions, given by (13) and (14), for the case when c(wy-’) = 0, with P(wmlwy-l) = 0 B(wy-1) = 1, and we finally combine estimates (13), (15), and (17) in the following recursive expression for the conditional probability distribution: Ps(wrn\wy-‘) = P(wrnlwT-’) + B(~(w,~w~-’)) * a(Wy-1) Ps(wrn\wT-’) (18) where We now propose a somewhat modified version of the distribu- tion given by (18). We shall leave intact the estimate nl / N for the probability of all unseen m-grams and we shall not discount high values of counts c > k, considering them as reliable. To achieve this, we redefine d, = 1, for r > k (20) and we shall correspondingly adjust Turing’s original discount coef- ficients d, for r 5 k so that the equation expressing the balance between the “contributions” and the probability of unseen m-grams is satisfied. Equation (21) is analogous to (9). We obtain an ad- justment of the d, looking for a solution of (21) in the form where p is a constant. The unique solution is given by -- r* (k + 1) %+I (k + 1) %+I r nl d, = , for 1 5 r 5 k.


View Full Document

MSU CSE 842 - LECTURE NOTES

Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?