Penn COGS 502 - An Estimate of an Upper Bound for the Entropy of English

Unformatted text preview:

An Estimate of an Upper Bound for the Entropy of English Peter E Brown* Vincent J. Della Pietra* Robert L. Mercer* IBM T.J. Watson Research Center Stephen A. Della Pietra* Jennifer C. Lai* We present an estimate of an upper bound of 1.75 bits for the entropy of characters in printed English, obtained by constructing a word trigram model and then computing the cross-entropy between this model and a balanced sample of English text. We suggest the well-known and widely available Brown Corpus of printed English as a standard against which to measure progress in language modeling and offer our bound as the first of what we hope will be a series of steadily decreasing bounds. 1. Introduction We present an estimate of an upper bound for the entropy of characters in printed English. The estimate is the cross-entropy of the 5.96 million character Brown Corpus (Kucera and Francis 1967) as measured by a word trigram language model that we constructed from 583 million words of training text. We obtain an upper bound of 1.75 bits per character. Since Shannon's 1951 paper, there have been a number of estimates of the entropy of English. Cover and King (1978) list an extensive bibliography. Our approach differs from previous work in that 1. We use a much larger sample of English text; previous estimates were based on samples of at most a few hundred letters. 2. We use a language model to approximate the probabilities of character strings; previous estimates employed human subjects from whom probabilities were elicited through various clever experiments. 3. We predict all printable ASCII characters. 2. Method Our estimate for the entropy bound is based upon the well-known fact that the cross- entropy of a stochastic process as measured by a model is an upper bound on the entropy of the process. In this section, we briefly review the relevant notions. 2.1 Entropy, Cross-Entropy, and Text Compression Suppose X = {... X-2, X-l, Xo, X1, X2...} is a stationary stochastic process over a finite alphabet. Let P denote the probability distribution of X and let Ep denote expectations * P.O. Box 704, Yorktown Heights, NY 10598 (~) 1992 Association for Computational LinguisticsComputational Linguistics Volume 18, Number 1 with respect to P. The entropy of X is defined by H(X) =_ H(P) =_ -EplogP(Xo l X_I,X_2,...). (1) If the base of the logarithm is 2, then the entropy is measured in bits. It can be shown that H(P) can also be expressed as H(P) = lim-EplogP(Xo[X-1,X-2,...,X_n) = lim-1EplogP(X1X2...Xn). (2) If the process is ergodic, then the Shannon-McMillan-Breiman theorem (Algoet and Cover 1988) states that almost surely H(P) = lim - 1 log P(X1X2... Xn). (3) n---* cx~ y/ Thus, for an ergodic process, an estimate of H(P) can be obtained from a knowledge of P on a sufficiently long sample drawn randomly according to P. When P is not known, an upper bound to H(P) can still be obtained from an approximation to P. Suppose that the stationary stochastic process M is a model for P. The cross-entropy of P as measured by M is defined by H(P,M) =-- -Ep logM(Xo l X_I,X_2,...). (4) Under suitable regularity conditions, it can be shown that H(P,M) = lim-EplogM(XolX_I,X_E,...,X_n) = lim-1-EplogM(X1X2...Xn). n'--+ CX~ n"-* DO n (s) If P is ergodic, then it can be shown that almost surely for P H(P, M) = lim - 1 logM(XIX2... Xn). (6) n ---+ oo n The cross-entropy H(P, M) is relevant to us since it is an upper bound on the entropy H(P). That is, for any model M, H(P) < H(P,M). (7) The difference between H(P, M) and H(P) is a measure of the inaccuracy of the model M. More accurate models yield better upper bounds on the entropy. Combining Equa- tions (6) and (7) we see that almost surely for P, H(P) < lim - 1 log M(X1X2... Xn). (8) n---*OO n Entropy and cross-entropy can be understood from the perspective of text com- pression. It is well known that for any uniquely decodable coding scheme (Cover and Thomas 1991), Ep I(XIX2... Xn) ~ -Ep log e(XlX2... Xn) , (9) where I(X1X2...Xn) is the number of bits in the encoding of the string X1X2...Xn. Combining Equations (2) and (9), we see that H(P) is a lower bound on the average number of bits per symbol required to encode a long string of text drawn from P: H(P) <__ lira 1Ep I(X1X2...Xn). (10) n ---* oo n 32Brown et al. An Estimate of an Upper Bound for the Entropy of English On the other hand, an arithmetic coding scheme (Bell, Cleary, and Witten 1990) using model M will encode the sequence xlx2... Xn in IM(XlX2... Xn) = r -- logM(XlX2... Xn) + 11 (11) bits, where [r] denotes the smallest integer not less than r. Combining Equations (7) and (11) we see that H(P,M) is the number of bits per symbol achieved by using model M to encode a long string of text drawn from P: H(P,M) = lim llM(X1X2...Xn). (12) n---*oo n 2.2 The Entropy Bound We view printed English as a stochastic process over the alphabet of 95 printable ASCII characters. This alphabet includes, for example, all uppercase and lowercase letters, all digits, the blank, all punctuation characters, etc. Using Equation (8) we can estimate an upper bound on the entropy of characters in English as follows: 1. Construct a language model M over finite strings of characters. 2. Collect a reasonably long test sample of English text. 3. Then H(English) <_ __1 log M(test sample), n (13) where n is the number of characters in the sample. We emphasize that for this paradigm to be reasonable, the language model M must be constructed without knowledge of the test sample. Without this proscription, one might, for example, construct a model that assigns probability one to the test sample and zero to any other character string of the same length. Even quite subtle use of knowledge of the test sample can have a profound effect on the cross-entropy. For example, the cross-entropy would be noticeably lower had we restricted ourselves to characters that appear in the test sample rather than to all printable ASCII characters, and would be lower still had we used the actual vocabulary of the test sample. But these values could not be trumpeted as upper bounds to the entropy of English since Equation (13) would no longer be valid. 3. The Language Model In this section, we describe our language model. The model is very simple: it captures the structure of English only through token trigram frequencies. Roughly speaking, the model


View Full Document

Penn COGS 502 - An Estimate of an Upper Bound for the Entropy of English

Download An Estimate of an Upper Bound for the Entropy of English
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view An Estimate of an Upper Bound for the Entropy of English and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view An Estimate of an Upper Bound for the Entropy of English 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?