DOC PREVIEW
Stanford CS 224 - Lecture Notes

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

T U R K A L A T O R1 A Suite of Tools for Augmenting English-to-Turkish Statistical Machine Translation by Gorkem Ozbek [[email protected]] Siddharth Jonathan [[email protected]] CS224N: Natural Language Processing Final Project Prof. Christopher D. Manning June 7, 2006 1 Introduction Statistical machine translation (here on MT) research has traditionally focused on popular languages of the developed world: English, other major European languages, and some major Asian languages. The reason for this is two-fold. On one hand agencies funding MT research, guided by practical concerns that may sometimes elude researchers, demand systems tuned for major these languages and where the funding goes, research goes, as evidenced, for example, by the recent increase in Arabic and Chinese MT 1 A note on the title: Turks believe, perhaps more than people of other cultures, that one needs to first and foremost master the customs and idiosyncrasies of a culture to truly belong in it. This is a fitting metaphor for our goals in this project: customize the standard statistical MT system building pipeline for specific features of Turkish so that the resulting product not only translates English text, but it also turkalates it.2 research. On the other hand, until recently the resources required for developing statistical MT systems, especially large parallel corpora, have been only available for major languages. Overall languages for which there are scarce resources, whether financial or linguistic, have received little attention from the statistical MT research community. Despite being spoken widely around the world in different forms, Turkish has been one of these languages: there has been minimal research in MT systems where Turkish is either the source or target language. This project attempts to fill this gap, if only slightly. We designed and implemented a suite of tools intended for aiding the statistical MT creation process by exploiting linguistic features of Turkish and how these features differ from their counterparts in English. The tools we designed were then coupled with existing software to produce an English-to-Turkish MT system that would hopefully perform better than the baseline system constructed without concern for the idiosyncrasies of the specific task at hand. Turkish differs significantly in its linguistic structure from English. While the differences are wide-spread across many aspects of the language pair, for the purposes of this project we focused on difficulties stemming from Turkish morphology and morpho-syntax. As an agglutinative language, Turkish has a much richer inflectional morphology than English. On the word level, this means that individual morphemes in Turkish often assume semantic and syntactic functions in a well-formed sentence that are, in English, reserved almost exclusively for whole words. This is most apparent in the structure of Turkish verbs, which encode in their morphology among other things the semantic role played by auxiliary verbs that in English would be modifying it. On the phrasal level, the difficulties with morphology also extend to the morphology-syntax interface.2 The main problem rich morphology causes for the statistical MT framework is one of data scarcity. Both essential components of the translation system, the translation model and the language model, suffer from the low data to parameter ratio due to scarcity. A thorough morphological analysis is one way of remedying the scarcity problem (Dejean et al., 2003). However, this approach assumes the existence of a morphological transducer, an abstract machine that breaks down the surface form of a word into its stem and morphemes and constructs it back given this stem and list of morphemes. While we had our data available in morphologically analyzed form, we did not have access to the transducer that produced it. Furthermore, developing a fully-functional transducer with appropriate morphology definitions for Turkish and English was impractical due to time and scope limitations. Instead, we picked the approach of approximating a morphological analysis for Turkish and then using this approximate analysis to compensate for the negative effects caused data scarcity in building the translation and language models. Similar efforts have been shown to improve at least word alignment for the English-German language pair (Corston-Oliver and Gamon, 2004). Other methods such as using weighted finite-state transducer composition have also been proposed to aid word alignment in for language pairs of varying morphological complexity (Schafer and Drabek, 2005). 2 While we are addressing some of the issues involved in Turkish morpho-syntax within the scope of this project (most notably in the phrase extraction algorithm), there are others that we leave mostly unexplored. For example, as a language with free word order Turkish marks subjects and objects in a sentence through morphemes.3 In the next section we discuss in detail the components we developed in accordance with our approach and the design decisions we made to render our system sensitive to morphological structure of Turkish. 2 System Components In addition to the corpora we used, our system consists of three processes that we developed specifically for this task: preprocessing including morphological approximation, word alignment and phrase extraction and scoring. We used these processes in tandem with existing software to stitch together a machine translator. Below we discuss this construction pipeline, including the language model creation, and comment on our data. 2.1 Corpus Our parallel corpus of English and Turkish text was graciously made available to us by Prof. Kemal Oflazer of Sabanci University, Istanbul, Turkey. It consists of approximately 22,000 aligned sentence pairs drawn across a variety of genres: George Orwell’s novel 1984 makes up some of the text along with tourism guides, court hearings and others. A morphologically analyzed version of the corpus was also made available. The most important thing to notice about this corpus is its relatively small size. Generally, for sufficient training of MT systems corpora at least an order of magnitude larger than ours are used, even when additional scarcity problems introduced by morphology are not of concern. On the other hand, however, the assumption is using large corpora frees one from having to deal separately with morphology. Hence, using


View Full Document

Stanford CS 224 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?