This preview shows page 1-2-3-4 out of 11 pages.

View Full Document
View Full Document

End of preview. Want to read all 11 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Example-Based Machine Translation: An InvestigationProblemA Proposed SolutionIndexingChunk searching and subsumingAlignmentImplementationThe project is implemented in Java.CorpusDifficulties in alignmentFiles IncludedLinguistic models and their validityDesign decisionsTestingResultsEffect of Corpus PreparationPerformance of the IndexerFri Jun 7 02:02:56 PDT 2002Performance of the ChunkFinderFailures and ReasonsSuggestions for ImprovementResponsibilitiesReferencesExample-Based Machine Translation: An InvestigationSTEVEN S NGAI  RANDY B GULLETTCS224N FINAL PROJECTProblemNow more than ever, the world looks to computers to perform the task of translation.Spurred on by the information age, more and more computer-enabled sources are pouringan increasing proportion of documents into global forums including, but not limited to,the Internet. Forums like these are becoming sources of information for a growingnumber of people worldwide, and it is no surprise that everyone wants his information inhis own language. Often, restrictions on the accuracy of these translations have becometighter: bodies like the European Union produce daily proceedings that, by law, must betranslated into all languages of its constituent countries so precisely that any translationcan be used in a court of law. Not surprisingly, human translators are unable to keep upwith this demand.Among machine translation systems, traditional transformational methods are somewhatdifficult to contruct, as they basically involve hardcoding the idiosyncrasies of bothlanguages. But through the work of human translators, large parallel corpora havebecome available. Therefore it makes sense, if it is viable, to base translations off theselarge bodies of text—this in order somehow to capture the knowledge contained inpreexisting translations. Our investigation attempts to look into one such method and itssuccesses and failings.A Proposed SolutionExample based machine translation (EBMT) is one such response against traditionalmodels of translation. Like Statistical MT, it relies on large corpora and tries somewhat toreject traditional linguistic notions (although this does not restrict them entirely fromusing the said notions to improve their output). EBMT systems are attractive in that theyrequire a minimum of prior knowledge and are therefore quickly adaptible to manylanguage pairs.The particular EBMT system that we are examining works in the following way. Givenan extensive corpus of aligned source-language and target-language sentences, and asource-language sentence to translate:1. it identifies exact substrings of the sentence to be translated within the source-language corpus, thereby returning a series of source-language sentences2. it takes the corresponding sentences in the target-language corpus as thetranslations of the source-language corpus (this should be the case!)3. Then for each pair of sentences:4. it attempts to align the source- and target-language sentences;5. it retrieves the portion of the target-language sentence marked as aligned with thecorpus source-language sentence’s substring and returns it as the translation of theinput source-language chunk.NGAI  GULLETTThe above system is a specialization of generalized EBMT systems. Other specificsystems may operate on parse trees or only on entire sentences.The system requires the following:1. Sentence-aligned source and target corpora.2. Source- to target- dictionary3. (Stemmer)The stemmer is necessary because we will typically find only uninflected forms indictionaries. While it is consulted in the alignment algorithm, it is not consulted in thematching step—as stated before, those matches must be exact.In this project we rely on papers published by Ralf D. Brown and by Sergei Niremburgdescribing work on the PanGloss translation project. Their two approaches are different,but nevertheless provided a good guideline for our implementation.Methods (Algorithms)IndexingIn order to facilitate the search for sentence substrings, we need to create an invertedindex into the source-language corpus. To do this we loop through all the words of thecorpus, adding the current location (as defined by sentence index in corpus and wordindex in sentence) into a hashtable keyed by the appropriate word. In order to save timein future runs we save this to an index file.Chunk searching and subsumingKeep two lists of chunks: current and completed.Looping through all words in the target sentence:See whether locations for the current word extend any chunks on the current listIf they do, extend the chunk.Throw away any chunks that are 1-word. These are rejected.Move to the completed list those chunks that were unable to continueStart a new current chunk for each location At the end, dump everything into completed.Then, to prune, run every chunk against every other:If a chunk properly subsumes another, remove the smaller oneIf two chunks are equal and we have too many of them, remove oneAlignmentThe alignment algorithm proceeds as follows:1. Stem the words of specified source sentence2. Look up those words in a translation dictionary3. Stem the words of the specified target sentence4. Try to match the target words with the source words—wherever they match, markthe correspondence table.5. Prune the table to remove unlikely word correspondences.2NGAI  GULLETT6. Take only as much target text as is necessary in order to cover all the remaining(unpruned) correspondences for the source language chunk.Stemming is done using .RANDY YOUR STUFF GOES HERE.Pruning is done using .The pruning algorithm relies on the fact that single words are not often violentlydisplaced from their original position. This assumption is true between English and mostof the Romance languages; however, notable exceptions may (but not necessarily)include the oft-cited non-SVO languages Korean, Japanese, and Arabic. In addition, thepruning algorithm works best when most word correspondences are 1-to-1.ImplementationThe project is implemented in Java.The corpus was prepared using a small Perl script and command-line tools; it wasfinalized by hand.CorpusWe used English-Spanish texts from the Pan American Health Organization as ourbilingual corpus. To select files for this purpose, we examined the files and chose thosewhich seemed to be reports,


View Full Document
Loading Unlocking...
Login

Join to view Example-Based Machine Translation - An Investigation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Example-Based Machine Translation - An Investigation and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?