SWARTHMORE CS 97 - Word Alignment of Parallel Texts - D1614499

Home> Schools> Swarthmore College> (CS) > CS 97> Word Alignment of Parallel Texts

DOC PREVIEW

SWARTHMORE CS 97 - Word Alignment of Parallel Texts

School name Swarthmore College

Course Cs 97- Computer Perception

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Appeared in: Proceedings of the Class of 2005 Senior Conference, pages 42–47Computer Science Department, Swarthmore CollegeWord Alignment of Parallel TextsJoshua BerneyDepartment of Computer ScienceSwarthmore [email protected] PeriniDepartment of Computer ScienceSwarthmore [email protected] induced a word-aligned dictionary ofEnglish and French using parallel texts.Our texts were the Hansards corpus anda small literary text corpus. We performedphrase alignment by the use of identicalwords in both texts as anchor points andimproved the distribution of our anchorpoints with lexically similar words. Wethen performed statistical word-alignmentusing φ statistical correlation to locatetranslation word pairs in the parallel cor-pora. Our results show that φ correlationworks reasonably well when a large num-ber of small parallel phrases are available.1 IntroductionThe goal of this project is to induce a translation dic-tionary between two similar languages using parallelcorpora. The two languages we chose were Englishand French, as they mostly share the same characterset and have significant linguistic similarities. Oneready source of large blocks of parallel texts in En-glish and French are classic literary works that havebeen translated. These have the advantages of beingin the public domain and are long documents withconsistent word usage and translation style through-out. In addition, the translation of a literary text willleave a large number of words untouched and un-translated such as characters’ names, locations, etc.We will need these and any French-English cognatesin our phrase alignment algorithm. The disadvan-tages of using literary works is that the translators,in an attempt to reproduce the style of the originaltexts, are less likely to produce exact translations,will use less common words, and will repeat wordsless often.The literary text we used was “Swann’s Way”,the first volume of Marcel Proust’s Remembrance ofThings Past, which is approximately 200,000 wordsin French and English. We took the text from theProject Gutenberg website1. We also ran our sys-tem against part of the Hansards corpus2, which isthe proceedings of the Canadian Parliament and is inboth English and French. This corpus was appealingbecause it was already split into sentence alignmentsand was very large (approximately 1 million words).We started with phrase and sentence alignmentsusing anchor points, which were words that are iden-tical in either text. We then increased the number ofanchor points we used by finding likely matches us-ing lexical similarity. Armed with a large number ofaligned phrases, we then match likely pairs using theφ statistic correlation method.2 Previous WorkDmitriy’s (2005) work has a number of similaritiesto ours in his intentions, his system induces dictio-naries for languages with few machine translationresources from parallel texts in linguistically simi-lar languages. He aligned his text on a character-to-character basis, not word tokens, and he then per-1Project Gutenberg main site: http://www.gutenberg.org. The specific Proust text can be foundat: http://www.gutenberg.org/etext/2650 andhttp://www.gutenberg.org/etext/71782http://www.isi.edu/natural-language/download/hansard/142Appeared in: Proceedings of the Class of 2005 Senior Conference, pages 42–47Computer Science Department, Swarthmore College 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000French Anchor IndexEnglish Anchor Index"old.out" using 2:3Figure 1: Output from the original anchor alignmentprocedure: English vs. French Anchor Positionforms a number of post-processing steps to improvethe matches he receives from the GIZA++ software,which is a statistical alignment model updated byFranz Joseph Och. One of these steps is the use oflexically similar words, determined by edit distance,as ‘seed’ words for the alignment models.Melamed (1999) discusses a much more com-plicated alignment process for bilingual parallelcorpora. He uses cognates and lexically similarwords with lexical similarity being determined bythe Longest Common Subsequence Ratio method.The alignment is further refined using methods takenfrom signal noise filtering, as well as several-passsegment alignment and subsection deviations.Gale and Church (1991) discuss using the φ statis-tic for determining word correspondences. How-ever, their paper is preliminary and provides onlyvague numbers. There does not appear to be anyfollowup work.3 Phrase AlignmentThe first step in our system is finding equivalentphrases in the source and target languages. Thisis done to reduce the total number of comparisonsthat must be made to find translations and to avoidfalse matches of words that are very far apart fromeach other in the text. Most post-segment-alignmentmatching algorithms increase much faster than O(n),where n is the number of words in a corpus, ourphrase-alignment method can significantly reducethe time required later in the system.Our algorithm relies on the fact that some wordsare exactly lexically similar in the source and tar-get languages, typically nouns. Common examplesof such words are places, names, and recently devel-oped concepts. Using these words, we can divide thesource and target texts into equivalent phrases. Be-fore beginning the main algorithm, we standardizeor eliminate most punctuation. Next, we locate theindices of words that are exactly the same in the tar-get and source language and record their indices. Welimit the minimum length of words to exclude whichare exactly lexically similar, but are actually differ-ent words, such as the English ’a’ and French ’a’(the English ’a’ is an indefinite article whereas ’a’in French can mean the singular third-person conju-gation of ’avoir’, ’to have’). We also limit the num-ber of occurrences of words in hopes of limiting thenumber of times one word appears very close to it-self and hence creates possible confusion over theactual anchor pair matching.Examining a plot of English vs. French positiongenerated from the above algorithm (Figure 1), wesee a relatively clear line through the origin (num-ber of English words, number of French words) andmany scattered points throughout the plot. Consid-ering the solid line in the figure and the structure oflanguage, we make the assumption that a linear re-lation exists between

View Full Document