400 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASP-35, NO. 3, MARCH 1987 Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer SLAVA M. KATZ Abstract-The description of a novel type of rn-gram language model is given. The model offers, via a nonlinear recursive procedure, a com- putation and space efficient solution to the problem of estimating prob- abilities from sparse data. This solution compares favorably to other proposed methods. While the method has been developed for and suc- cessfully implemented in the IBM Real Time Speech Recognizers, its generality makes it applicable in other areas where the problem of es- timating probabilities from sparse data arises. Sparseness of data is an inherent property of any real text, and it is a problem that one always encounters while collecting fre- quency statistics on words and word sequences (m-grams) from a text of finite size. This means that even for a very large data col- lection, the maximum likelihood estimation method does not allow us to adequately estimate probabilities of rare but nevertheless pos- sible word sequences-many sequences occur only once (“single- tons”); many more do not occur at all. Inadequacy of the maximum likelihood estimator and the necessity to estimate the probabilities of m-grams which did not occur in the text constitute the essence of the problem. The main idea of the proposed solution to the problem is to re- duce unreliable probability estimates given by the observed fre- quencies and redistribute the “freed” probability “mass” among m-grams which never occurred in the text. The reduction is achieved by replacing maximum likelihood estimates for m-grams having low counts with renormalized Turing’s estimates [l], and the re- distribution is done via the recursive utilization of lower level con- ditional distributions. We found Turing’s method attractive be- cause of its simplicity and its characterization as the optimal empirical Bayes’ estimator of a multinomial probability. Robbins in [2] introduces the empirical Bayes’ methodology and Nadas in [3] gives various derivations of the Turing’s formula. Let N be a sample text size and let n, be the number of words (m-grams) which occurred in the text exactly r times, so that N = C rn,. (1) Turing’s estimate PT for a probability of a word (m-gram) which occurred in the sample r times is r r* PT = where We call a procedure of replacing a count r with a modified count r’ “discounting” and a ratio rt/r a discount coefficient dr. When r’ = r*, we have Turing’s discounting. Let us denote the m-gram wl, * . . , w, as wy and the number of times it occurred in the sample text as c(wT). Then the maxi- mum likelihood estimate is Manuscript received May 1, 1986; revised September 2, 1986. The author is with the Continuous Speech Recognition Group, Depart- ment of Computer Sciences, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598. IEEE Log Number 8611734. and the Turing’s estimate is where .*(x) = (.(x) + 1) nro+l* fkx) It follows from (1)-(3) that the total probability estimate, using (5), for the set of words (m-gram) that actually occurred in the sample is c PT(W;’) = 1 - - nl w;“: c(wf)> 0 N This, in turn, leads to the estimate for the probability of observing some previously unseen m-gram as a fraction nl /N of “single- tons” in the text: On the other hand, where Thus, Our method is based on the interpretation of 6, in (1 1) as a “con- tribution” of an m-gram w;’ with a count c(wy) to the probability of “unseen” m-grams. We go further with this interpretation for the estimating of conditional probabilities P(w, I wy - ’ ). Assum- ing that the (m - 1)-gram w;‘-’ has been observed in the sample text, we introduce an entity ijynd) analogous to 6,, given by (10) We now define our estimator P,( w, 1 w;”- ) inductively as follows. Assume that the lower level conditional estimator P,( w, I w;”-’ ) has been defined. Then, when c(w7-l) > 0, gives the conditional probability estimate for words w, observed in the sample texl after w7-I (c(w7 ) > 0). It is convenient to define a function /3 by This gives an estimate of the sum of conditional probabilities of all words w, which never followed wY-’(c(wy) = 0). We distribute the probability “mass” 0, defined by (14), among w, for which c(w7) = 0 using a previously defined (by induction) lower level conditional distribution P, (w, 1 WT - I ): Ps(wJW~-I) = .P,(w,Iwy-’) (15) 0096-3518/87/0300-0400$01 .OO 0 1987 IEEEIEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-35, NO. 3, MARCH 1987 40 1 where 1 - em P(w,jwy-’) wm:C(w,)>O - - 1 - em P(w,(wy-’) w,:c(w,)>O is a normalizing constant. When c(w7-I) = 0, then we define P,(w,JwT-’) = Ps(w,\wy-’). (17) Complementing P and definitions, given by (13) and (14), for the case when c(wy-’) = 0, with P(wmlwy-l) = 0 B(wy-1) = 1, and we finally combine estimates (13), (15), and (17) in the following recursive expression for the conditional probability distribution: Ps(wrn\wy-‘) = P(wrnlwT-’) + B(~(w,~w~-’)) * a(Wy-1) Ps(wrn\wT-’) (18) where We now propose a somewhat modified version of the distribu- tion given by (18). We shall leave intact the estimate nl / N for the probability of all unseen m-grams and we shall not discount high values of counts c > k, considering them as reliable. To achieve this, we redefine d, = 1, for r > k (20) and we shall correspondingly adjust Turing’s original discount coef- ficients d, for r 5 k so that the equation expressing the balance between the “contributions” and the probability of unseen m-grams is satisfied. Equation (21) is analogous to (9). We obtain an ad- justment of the d, looking for a solution of (21) in the form where p is a constant. The unique solution is given by -- r* (k + 1) %+I (k + 1) %+I r nl d, = , for 1 5 r 5 k.
View Full Document