A STATISTICAL APPROACH TO MACHINE TRANSLATION Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin IBM Thomas J. Watson Research Center Yorktown Heights, NY In this paper, we present a statistical approach to machine translation. We describe the application of our approach to translation from French to English and give preliminary results. 1 INTRODUCTION The field of machine translation is almost as old as the modern digital computer. In 1949 Warren Weaver sug- gested that the problem be attacked with statistical meth- ods and ideas from information theory, an area which he, Claude Shannon, and others were developing at the time (Weaver 1949). Although researchers quickly abandoned this approach, advancing numerous theoretical objections, we believe that the true obstacles lay in the relative impo- tence of the available computers and the dearth of machine- readable text from which to gather the statistics vital to such an attack. Today, computers are five orders of magni- tude faster than they were in 1950 and have hundreds of millions of bytes of storage. Large, machine-readable cor- pora are readily available. Statistical methods have proven their value in automatic speech recognition (Bahl et al. 1983) and have recently been applied to lexicography (Sinclair 1985) and to natural language processing (Baker 1979; Ferguson 1980; Garside et al. 1987; Sampson 1986; Sharman et al. 1988). We feel that it is time to give them a chance in machine translation. The job of a translator is to render in one language the meaning expressed by a passage of text in another lan- guage. This task is not always straightforward. For exam- ple, the translation of a word may depend on words quite far from it. Some English translators of Proust's seven volume work A la Recherche du Temps Perdu have striven to make the first word of the first volume the same as the last word of the last volume because the French original begins and ends with the same word (Bernstein 1988). Thus, in its most highly developed form, translation in- volves a careful study of the original text and may even encompass a detailed analysis of the author's life and circumstances. We, of course, do not hope to reach these pinnacles of the translator's art. In this paper, we consider only the translation of individ- ual sentences. Usually, there are many acceptable transla- tions of a particular sentence, the choice among them being largely a matter of taste. We take the view that every sentence in one language is a possible translation of any sentence in the other. We assign to every pair of sentences (S, T) a probability, Pr(TIS), to be interpreted as the probability that a translator will produce T in the target language when presented with S in the source language. We expect Pr(TIS) to be very small for pairs like (Le matin je me brosse les dents lPresident Lincoln was a good lawyer) and relatively large for pairs like (Le president Lincoln btait un bon avocat l President Lincoln was a good lawyer). We view the problem of machine translation then as follows. Given a sentence T in the target language, we seek the sentence S from which the translator produced T. We know that our chance of error is minimized by choosing that sentence S that is most probable given T. Thus, we wish to choose S so as to maximize Pr(SI T). Using Bayes' theorem, we can write Pr (S) Pr (TIS) Pr (SI T) = Pr (T) The denominator on the right of this equation does not depend on S, and so it suffices to choose the S that maxi- mizes the product Pr(S)Pr(TIS). Call the first factor in this product the language model probability of S and the second factor the translation probability of T given S. Although the interaction of these two factors can be quite profound, it may help the reader to think of the translation probability as suggesting words from the source language that might have produced the words that we observe in the target sentence and to think of the language model proba- bility as suggesting an order in which to place these source words. Thus, as illustrated in Figure 1, a statistical translation system requires a method for computing language model probabilities, a method for computing translation probabil- ities, and, finally, a method for searching among possible source sentences S for the one that gives the greatest value for Pr(S)Pr( TIS). In the remainder of this paper we describe a simple version of such a system that we have implemented. In the Computational Linguistics Volume 16, Number 2, June 1990 79Peter F. Brown et al. A Statistical Approach to Machine Translation Source Language Model S t TranslatiOnModel T Pr(S) x Pr(TIS) = Pr(S,T) A Source Language Model and a Translation Model furnish a probability distribution over source-target sentence pairs (S,T). The joint probability Pz (S, T) of the pair (S, T) is the product of the probability Pr (S) computed by the language model and the conditional probability Pr (T I S) computed by the translation model. The parameters of these models are estimated automatically f~om a large database of source-target sentence pairs using a statistical aigoritlim which optimizes, in an appropriate sense, the fit between the models and the data. T ~ Decoder = argmaxPr(S I T ) = argmaxPr (S,T) s s A Decoder performs the actual translation. Given a sentence T in the target language, the decoder chooses a viable translation by selecting that sentence in the source langnage for which the probability Pr (S [ T) is maximum. Figure 1 A Statistical Machine Translation System. next section we describe our language model for Pr(S), and in Section 3 we describe our translation model for Pr(T[S). In Section 4 we describe our search procedure. In Section 5 we explain how we estimate the parameters of our models from a large database of translated text. In Section 6 we describe the results of two experiments we performed using these models. Finally, in Section 7 we conclude with a discussion of some improvements that we intend to imple- ment. 2 THE LANGUAGE MODEL Given a word string, sis 2 ... s n, we can, without loss of generality, write Pr (&s2... s,) = Pr (Sl) Pr (s2[sl)... Pr
View Full Document