DOC PREVIEW
The MILE Corpus for Less Commonly Taught Languages

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1 Introduction2 Building an Elicitation CorpusThe MILE Corpus for Less Commonly Taught Languages[name][address1][address2][address3][email][name][address1][address2][address3][email][name][address1][address2][address3][email]AbstractThis paper describes a small,structured English corpus that isdesigned for translation into LessCommonly Taught Languages(LCTLs), and a set of re-usabletools for creation of similarcorpora.1 The corpus is highlystructured so that it can supportmachine learning with only a smallamount of data. The corpussystematically explores meaningsthat are known to affect morphologyor syntax in the world’s languages.Each sentence is associated with afeature structure showing theelements of meaning that arerepresented in the sentence. As partof the REFLEX program, the corpuswill be translated into multipleLCTLs, resulting in parallel corporacan be used for training of MT andother language technologies. 1 IntroductionOf the 6,000 living languages in the worldonly a handful have the necessarymonolingual or bilingual resources to build aworking statistical or example-based1 (Acknowledge NSF, AVENUE & This work was supported by the United States Central Intelligence Agency.) machine translation system. Currently, thereare efforts to build language packs for LessCommonly Taught Languages (LCTLs).Each language pack includes parallelcorpora consisting of naturally occurringtext translated from English into the LCTLor vice versa. This paper describes a small corpusthat supplements the naturally occurring textwith highly systematic enumeration ofmeanings that are known to affectmorphology and syntax in the world’slanguages. The supplemental corpus willenable the exploration of constructions thatare sparse or obscured in complex data. Thecorpus consists of n000 English sentences,totaling n000 words. It will be translatedinto each of the seven targeted LCTLs eachyear. This paper describes theconstruction of the corpus, including toolsFigure 1: A sampling of sentences from the complete elicitation corpusand resources that can be used for theconstruction of similar corpora. 2 Building an Elicitation CorpusThe MILE elicitation corpus is made upof a highly structured monolingual set ofsentences that will be translated by alanguage informant. Underlying eachsentence is a feature structure that serves tocodify the underlying meaning in eachsentence. Additionally, sentences areaccompanied by a context field that providesextra information in order to clarify themeaning of the sentence and providelanguage feature information that may not bepresent in the feature structure, but notinherent in the English sentence. Forexample, politeness information my beneeded in order to elicit a consistent andcorrect translation from the informant.Each feature structure in our corpus ismade of carefully layered sets of feature-value pairs. Each feature and itscorresponding sets of values are defined inan XML-based document called a featurespecification. Not only does thisspecification define and classify features, italso defines acceptable combinations offeatures and values (Alvarez et al., 2005).For example, for the main clause of a featurestructure to have a copula type it must firstbe defined as a copula sentence. The Elicitation corpus itself is madeup of feature structures that represent the crossproduct of a set of features. For example, toexplore a verb paradigm we might look atsentences with different combinations ofpolarity, tense and grammatical aspectfeatures. Furthermore, the combinations alsomight be multiplied out with different nounphrase features like person, number on theactor and undergoer of a feature structure. Thepolarity feature has two values: positive andnegative. Tense has seven values andgrammatical aspect has five. Person, andnumber each have three values. Thus, if wewere to take a full cross product of thesefeatures we would end up with 2 • 7 • 5 • 3 • 3= 630 feature structures. (***Check Numbers;MT Summit Citation) These thematic feature combinationsare outlined in feature maps. The corpusdesigned for this project has seven maps intotal defining the combinations of features thatwe deemed to be the mostmorphosyntactically, “active”. That is, thesewere the sets of features that not only would besrcsent: Mary was not a doctor.context: Translate this as though it were spoken to a peer co-worker;( (actor ((np-function fn-actor)(np-animacy anim-human)(np- biological-gender bio-gender-female) (np-general-type proper-noun-type)(np-identifiability identifiable)(np-specificity specific)…)) (pred ((np-function fn-predicate-nominal)(np-animacy anim- human)(np-biological-gender bio-gender-female) (np- general-type common-noun-type)(np-specificity specificity-neutral)…)) (c-v-lexical-aspect state)(c-copula-type copula-role)(c-secondary-type secondary-copula)(c-solidarity solidarity-neutral) (c-v-grammatical-aspect gram-aspect-neutral)(c-v-absolute-tense past) (c-v-phase-aspect phase-aspect-neutral) (c-general-type declarative-clause)(c-polarity polarity-negative)(c-my-causer-intentionality intentionality-n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary boundary-n/a)…)Figure 2: An abridged feature structure, sentence and context fieldmost likely to be manifestedmorphosyntactically in a language, but alsothose that best define the basic building blocksof a language, for example verb paradigms,simple and multi-clause sentence structures,noun phrase paradigms, causation, ormodality. These feature maps account fromanywhere from a few hundred featurestructures to a few thousand. Larger featuremaps can be broken into smaller ones. Ournoun phrase paradigm map has several smallermaps including those defining interrogativepronouns, reflexives, and demonstratives. Now that we have our sets of featurestructures


The MILE Corpus for Less Commonly Taught Languages

Download The MILE Corpus for Less Commonly Taught Languages
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The MILE Corpus for Less Commonly Taught Languages and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The MILE Corpus for Less Commonly Taught Languages 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?