DOC PREVIEW
Stanford CS 224 - Study Notes

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1. INTRODUCTION2. MATRIX MODEL2.1 Full matrices on GPUs2.2 Sparse matrices on multicoreprocessors3. MAP-REDUCE MODEL4. EVALUATION4.1 Technical notes5. DISCUSSION AND FUTURE WORK6. REFERENCESInclusion of large input corpora in Statistical Machine TranslationBipin SureshStanford [email protected] recent years, the availability of large, parallel, bilingual corpora has gone untapped by the statistical machine learning community. The crux of the problem lies in the inherent linearity of the traditional machine-translation algorithms, which impedes easy inclusion of new, large input corpora. However, it has been speculated [1] that there exists a log-linear relationship between the training corpora size and performance in machine-translation tasks. In this paper, we cast the IBM Model-1 algorithm into three formats that allow it to use large input corpora, and then verify the performance-gain claims. The models we introduce provide a scalable architecture for incorporating further corpora easily and cheaply, which we then show to translate into higher performance in machine-translation tasks.KeywordsNatural language processing, Statistical machine translation, IBM Model-11. INTRODUCTIONIBM's Model-1[2] is a word-alignment machine-translation model that makes use of parallel bilingual corpora. It assumes that a source sentence S of length l is translated into a target sentence T of size m by choosing, for every position in the target sentence j (1≤j≤m), a word in the source sentence eaj (which includes a special NULL word), according to some alignment a, and then translating the word eaj to a word fj with the help of a translation model t(f, e).Model-1 makes a few simplifying assumptions: first, it assumes that every possible length of the target sentence (less than some arbitrary upper-bound) has uniform probability ε; next, that all possible choices of source sentence generating words are equally likely; and finally, that the translation probability t(f, e) depends only on the source language word:The above equation by Brown et al[2] provides an estimate of the probability of a source sentence S translating into a target sentence T. Since Model-1 makes a further simplification assumption that each target word can be generated by a single source word (including the NULL word) only, we can work out the best alignment vector a as follows:The parameters of Model-1 are typically estimated using an Estimation Maximization (EM) algorithm[3]. A straightforward implementation of the algorithm has a serious problem though: since the algorithm has to iterate over every alignment of every singe sentence pair, it quickly becomes computationally infeasible. A common method to overcome this shortcoming is explained in detail in [3]. Figure 1 summarizes the algorithm currently in common usage.Model-1 has many shortcomings: (1) each target sentence word can be generated by only one source word; (2) the position of any word in the target sentence is independent of the corresponding word in the source sentence and; (3) the translation of one target word does not take into account how many other words the corresponding source word has already been translated to. Several improvements have been proposed [4] to counter these issues. In their original paper[2], Brown et al also propose higher order models – IBM Model-2, Model-3, Model-4 and Model-5 – which attempt to attack these problems systematically. However, there is also a practical consideration that goes unnoticed: Model-1 is essentially a linear algorithm, looping over every document sentence pair; and for each sentence pair, looping over every source-target word pair. This makes incorporating new large corpora extremely expensive, both in terms of memory as well as the time taken for execution. In this paper we attempt to remodel Model-1 to make it more parallelizable. Our contributions are as follows: We first cast the model into a matrix form, which enables us to compute scores in parallel by using fast, parallelized matrix operations. We then model the matrix as sparse matrices to reduce the memory foot-print while maintaining the gains in speed. Finally, we cast the model into the Map-Reduce framework to explore parallelization across a distributed architecture of machines. The paper is organized roughly along these contributions.Figure 1: IBM Model-1pseudo-code2. MATRIX MODEL2.1 Full matrices on GPUsRecent developments in graphics processing units (GPU) have enabled large-scale matrix manipulation operations to be computed efficiently in parallel. Multiple parallel computing architectures have been built to support application programmers. Two prominent approaches are Nvidia's CUDA, and OpenCL.To harness these powers of the GPU, we cast the IBM Model-1 algorithm into a series of matrix manipulation operations. First, we composed the entire corpora as a 3-dimensional matrix, with the first dimension being the source-words f, the second dimension being the target-words e, and the last dimension being the documents. A cell docs(f, e, d) is the count of the number of times word f was translated to word e in document d. Our algorithm to compute the translation probabilities between word f and word e is as follows:Figure-2: A matrix-operations only model of IBM's Model-1In the first two steps, we initialize the translation table between all words f and e with a uniform probability. Then, until convergence, we perform the following matrix operations: in step 4, we make a n-dimensional copy of our translation table, so that in step 5, we can do a element wise matrix multiplication with the docs matrix. Note that the element-wise matrix multiplication is inherently parallelizable, which makes step 5 very fast to compute. Next, in steps 6-8, we normalize the counts for every f. The function bsxfun is a special MATLAB® command which manipulates each element in a matrix with its corresponding element in a vector. Again, since each of the operations can be done in parallel, very little time is spent on the actual execution of


View Full Document

Stanford CS 224 - Study Notes

Documents in this Course
Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?