Stanford CS 224 - Study Notes - D535545

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Study Notes

DOC PREVIEW

Stanford CS 224 - Study Notes

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1. INTRODUCTION2. MATRIX MODEL2.1 Full matrices on GPUs2.2 Sparse matrices on multicoreprocessors3. MAP-REDUCE MODEL4. EVALUATION4.1 Technical notes5. DISCUSSION AND FUTURE WORK6. REFERENCESInclusion of large input corpora in Statistical Machine TranslationBipin SureshStanford [email protected] recent years, the availability of large, parallel, bilingual corpora has gone untapped by the statistical machine learning community. The crux of the problem lies in the inherent linearity of the traditional machine-translation algorithms, which impedes easy inclusion of new, large input corpora. However, it has been speculated [1] that there exists a log-linear relationship between the training corpora size and performance in machine-translation tasks. In this paper, we cast the IBM Model-1 algorithm into three formats that allow it to use large input corpora, and then verify the performance-gain claims. The models we introduce provide a scalable architecture for incorporating further corpora easily and cheaply, which we then show to translate into higher performance in machine-translation tasks.KeywordsNatural language processing, Statistical machine translation, IBM Model-11. INTRODUCTIONIBM's Model-1[2] is a word-alignment machine-translation model that makes use of parallel bilingual corpora. It assumes that a source sentence S of length l is translated into a target sentence T of size m by choosing, for every position in the target sentence j (1≤j≤m), a word in the source sentence eaj (which includes a special NULL word), according to some alignment a, and then translating the word eaj to a word fj with the help of a translation model t(f, e).Model-1 makes a few simplifying assumptions: first, it assumes that every possible length of the target sentence (less than some arbitrary upper-bound) has uniform probability ε; next, that all possible choices of source sentence generating words are equally likely; and finally, that the translation probability t(f, e) depends only on the source language word:The above equation by Brown et al[2] provides an estimate of the probability of a source sentence S translating into a target sentence T. Since Model-1 makes a further simplification assumption that each target word can be generated by a single source word (including the NULL word) only, we can work out the best alignment vector a as follows:The parameters of Model-1 are typically estimated using an Estimation Maximization (EM) algorithm[3]. A straightforward implementation of the algorithm has a serious problem though: since the algorithm has to iterate over every alignment of every singe sentence pair, it quickly becomes computationally infeasible. A common method to overcome this shortcoming is explained in detail in [3]. Figure 1 summarizes the algorithm currently in common usage.Model-1 has many shortcomings: (1) each target sentence word can be generated by only one source word; (2) the position of any word in the target sentence is independent of the corresponding word in the source sentence and; (3) the translation of one target word does not take into account how many other words the corresponding source word has already been translated to. Several improvements have been proposed [4] to counter these issues. In their original paper[2], Brown et al also propose higher order models – IBM Model-2, Model-3, Model-4 and Model-5 – which attempt to attack these problems systematically. However, there is also a practical consideration that goes unnoticed: Model-1 is essentially a linear algorithm, looping over every document sentence pair; and for each sentence pair, looping over every source-target word pair. This makes incorporating new large corpora extremely expensive, both in terms of memory as well as the time taken for execution. In this paper we attempt to remodel Model-1 to make it more parallelizable. Our contributions are as follows: We first cast the model into a matrix form, which enables us to compute scores in parallel by using fast, parallelized matrix operations. We then model the matrix as sparse matrices to reduce the memory foot-print while maintaining the gains in speed. Finally, we cast the model into the Map-Reduce framework to explore parallelization across a distributed architecture of machines. The paper is organized roughly along these contributions.Figure 1: IBM Model-1pseudo-code2. MATRIX MODEL2.1 Full matrices on GPUsRecent developments in graphics processing units (GPU) have enabled large-scale matrix manipulation operations to be computed efficiently in parallel. Multiple parallel computing architectures have been built to support application programmers. Two prominent approaches are Nvidia's CUDA, and OpenCL.To harness these powers of the GPU, we cast the IBM Model-1 algorithm into a series of matrix manipulation operations. First, we composed the entire corpora as a 3-dimensional matrix, with the first dimension being the source-words f, the second dimension being the target-words e, and the last dimension being the documents. A cell docs(f, e, d) is the count of the number of times word f was translated to word e in document d. Our algorithm to compute the translation probabilities between word f and word e is as follows:Figure-2: A matrix-operations only model of IBM's Model-1In the first two steps, we initialize the translation table between all words f and e with a uniform probability. Then, until convergence, we perform the following matrix operations: in step 4, we make a n-dimensional copy of our translation table, so that in step 5, we can do a element wise matrix multiplication with the docs matrix. Note that the element-wise matrix multiplication is inherently parallelizable, which makes step 5 very fast to compute. Next, in steps 6-8, we normalize the counts for every f. The function bsxfun is a special MATLAB® command which manipulates each element in a matrix with its corresponding element in a vector. Again, since each of the operations can be done in parallel, very little time is spent on the actual execution of

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

Stanford CS 224 - Study Notes

Sign up for free to view:

Please select your school