DOC PREVIEW
Stanford CS 224 - Lecture Notes

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Word Alignment in Bilingual Parallel Corpora Mohith Julapalli Sameer Dhond Natural Language Processing Final Project Spring 2003Introduction This project attempts the task of word alignment using parallel text in different languages. Specifically, it tries to build a bilingual dictionary given a set of French and English parallel texts obtained from the Hansard's corpora. The utility of this word alignment model is as a part of a larger machine translation system. Initially, we intended on implementing algorithms by Kay and Roscheisen (1993) and Stanley F. Chen (1997), and comparing the results. Both are Sentence-Alignment algorithms that produce as a byproduct Word-Alignment information. However, after developing the method from Kay and Roscheisen, we found that while it performed reasonably well on its intended task of Sentence-Alignment, it did not produce as comprehensive a Word-Alignment model as we hoped. And although there were various threshold parameters we could tune to produce more words, the accuracy would then degrade considerably. In addition, the algorithm did not scale well to larger sets of data (which made it even harder to generate a large bilingual dictionary). We then decided to use an algorithm described by Dan Melamed (1997) specifically for aligning words. This method performed better than Kay and Roscheisen in accuracy (perhaps because it receives aligned sentences as input rather than generate the alignments), and also in the size of the dictionary it produced. We could successfully run this algorithm on 10,000 sentences in each language, whereas the Kay method was limited to some 1,000. Both algorithms make a one-to-one assumption—Kay and Roscheisen at the sentence and word level, and Melamed at the word level. While this may not be a completely accurate assumption, it makes the problem tractable and produces decent results. Later, we discuss some of the erred translations our algorithms produced as a result of this assumption. Dictionary Evaluation We wrote our algorithms so that they would output tab-delimited bilingual dictionaries. Initially, we intended on calculating accuracy automatically using an English-French text dictionary and supplementing it with some online web dictionaries (which we save into text files to avoid overloading web servers). However, this did not work too well because most of the Hansard data is conversational. In addition words appear in different morphological forms, making it difficult to automate the verification. So instead, we manually checked our dictionaries, randomly sampling a few hundred words when the data became too large. Algorithm 1: Kay and Roscheisen In this algorithm, words in the two languages are aligned based on a similarity score—e.g. how similar their distributions in the sentences are. Data structures: Alignable Sentences Table This is a 2D chart indicating which sentences in one language could possibly correspond to sentences in the other language. It is initialized with fixed points at 0,0 and m,n (where m,n is the number of sentences in each language). The possible correspondences are interpolated between these points, becoming wider (on the order of sqrt(N)) towards the middle of the text.Figure 1. Initial Alignable Sentence Table Word Alignment Table The WAT is basically a list of pairs of words and a score indicating how similar the distributions in their corresponding sentences are. For our purpose, the final dictionary is pulled from the WAT after the last iteration. Sentence Alignment Table The SAT contains sentence pairs that are deemed to be aligned. These pairs are a subset of points in the AST, and are used to regenerate the AST after each iteration. (See the red points in figures 4 and 5 below). Figure 2. Algorithm flow Cartesian product of all words in all sentences of AST Relax SAT Use correspondence sets using to compute new SAT interpolation Alignable Sentence Table Word Alignment Table Sentence Alignment TableSummary of the algorithm: 1. Initialize the AST (as described above). 2. Compute a WAT from the AST by taking a cartesian product of the words and calculating their similarity. Similarity = 2 * c / (Na(v) + Nb(w)) c is the size of the largest correspondence set. A correspondence set is the set of sentences in which both words appear, with no overlap. Na(v) and Nb(w) are the frequencies of word v in text A and w in B, respectively. 3. Threshold and sort the WAT. We have to partition the WAT into several segments based on frequency because we give more value to the words that appear with high frequency and high similarity. 4. Compute a SAT from the WAT by taking the maximum probability alignment. We iterate through the WAT in order. For any words that are paired, pair their corresponding sentences for the SAT. Throw out any WAT pairs that cause overlap in the sentence alignments. For example, if sentence 1 in English is aligned to sentence 3 in French, then 2 in English cannot align to 1 in French. Figure 3. Example of an invalid (overlapping) sentence alignment English French Sentence 1 Sentence 1 Sentence 2 Sentence 2 Sentence 3 Sentence 3 5. Use the entries from the SAT as fixed points in the AST, and interpolate between these fixed points to produce a new AST. Interpolation is done in the same manner as the initial AST, except now we interpolate between each pair of entries in the SAT. 6. Iterate steps 2-4 until the SAT converges. The SAT is the final aligned sentences, and the final WAT is the aligned words.Figure 4. AST after 1 iteration (with 12 entries in the SAT). Figure 5. AST after convergence (with 75 entries in the SAT) Optimizations There are a few optimizations to the algorithm to help it run faster and also throw out any unlikely data: 1. After computing the WAT, throw out any pairs whose similarity score is below a certain threshold.2. After computing the SAT, throw out any pairs that do not have a large number of correspondences. These two thresholds can be tuned to tradeoff between accuracy and recall; the lower the threshold, the more words/sentences in the alignment, but the more likely chance for error. In addition, the thresholds are not constant per iteration. Each iteration they are slightly lowered. We decrease the thresholds in each iteration for the WAT and SAT because early in the


View Full Document

Stanford CS 224 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?