DOC PREVIEW
fullPaper

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Data Collection and Analysis of Mapudungun Morphology for Spelling CorrectionChristian Monson1, Lori Levin1, Rodolfo Vega1, Ralf Brown1, Ariadna Font Llitjos1, Alon Lavie1, Jaime Carbonell1, Eliseo Cañulef2, Rosendo Huisca21Language Technologies Institute, Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 2Instituto de Estudios Indígenas, Universidad de La FronteraMontevideo 0870, Temuco, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Collection and Analysis ofMapudungun Morphology for Spelling CorrectionChristian Monson1, Lori Levin1, Rodolfo Vega1, Ralf Brown1, Ariadna Font Llitjos1, AlonLavie1, Jaime Carbonell1, Eliseo Cañulef2, Rosendo Huisca21Language Technologies Institute, Carnegie Mellon University5000 Forbes Ave., Pittsburgh, PA 152132Instituto de Estudios Indígenas, Universidad de La FronteraMontevideo 0870, Temuco, [email protected], [email protected], [email protected], [email protected], [email protected],[email protected], [email protected], [email protected], [email protected] paper describes part of a three year collaboration between Carnegie Mellon University's Language Technologies Institute, thePrograma de Educación Intercultural Bilingüe of the Chilean Ministry of Education, and Universidad de La Frontera (Temuco, Chile).We are currently constructing a spelling checker for Mapudungun, a polysynthetic language spoken by the Mapuche people in Chileand Argentina. The spelling checker will be built in MySpell, the spell checking system used by the open source office suiteOpenOffice. This paper also describes the spoken language corpus that is used as a source of data for developing the spelling checker.IntroductionThis paper describes partof a three yearcollaboration betweenCarnegie MellonUniversity's LanguageTechnologies Institute, thePrograma de EducaciónIntercultural Bilingüe ofthe Chilean Ministry ofEducation, andUniversidad de LaFrontera (Temuco, Chile).In a previous paper (Levinet al., 2002) we providedan overview of the project.In this paper, we will focuson the preparation ofcorpora and lexica thatwill support an on-linelexicon and a spellingcorrector for Mapudungun,an indigenous language ofChile.Our project hasscientific and socialsignificance. Thescientific novelty of theproject is in the applicationof computational tools(such as morphologicalanalysis, Example-BasedMachine Translation, andTransfer Based MT) to apolysynthetic language.We are also working onnew techniques forautomatically learningtransfer rules from word-aligned bilingual data(Carbonell et al., 2002;Probst et al., 2001, 2002a,2002b, 2003; Lavie et al.,in press).The socialsignificance of the projectstems from the ChileanMinistry of Education'scommitment to bilingualeducation in Spanish andMapudungun for Mapuchechildren, where computer-based tools are a welcomepart of the bilingualeducation program.Chile's electroniceducation network project,ENLACES, for example,provides computers andnetworking to all Chileanschools, including those inrural areas.MapudungunMapudungun, apolysynthetic languagewith noun and verbincorporation, is thelanguage of over 900,000Mapuche people in Chileand Argentina. While themorphology of other partsof speech is relativelysimple, Mapudungun has acomplex agglutinativesuffixal verb morphology—some analyses provideas many as 36 verb suffixslots (Smeets, 1989). Atypical complex verb formoccurring in our corpus ofspoken Mapudungunconsists of five or sixmorphemes.A verb beginswith a stem and ends withan obligatory morpheme-sequence marking, in thecase of finite clauses, theperson and number of thesubject together with themood of the verb or, in thecase of non-finite clauses,adverbialization ornominalization. A numberof morphemes may occurbetween the verb stem andthe verb-final morphemecluster, including aspect,tense, applicative, voice,directional, and objectagreement markers. Ifincorporation occurs, theincorporated noun or verbis placed immediatelyfollowing the verb stem.The relative order of theverbal morphemes isusually fixed, and there areonly a few simplemorphophonemic changesat morpheme boundaries.Error: Reference sourcenot found contains glossesof a few morphologicallycomplex Mapudungunverbs taken from ourbilingual lexicon. amu -ke -yngüngo -habitual -3plIndicThey (usually) gongütrümtu -a -lucall -fut -adverbWhile calling (tomorrow), …nentu -ñma -nge -ymiextract -mal -pass -2sgIndicyou were extracted (on me)ngütramka-me -a -fi -ñtell -loc -fut -3obj -1sgIndicI will tell her (away)Figure 1: Examples of Mapudungun verbal morphology taken from our corpus of spoken MapudungunCorpora and LexicaThe CMU-Chile project,Avenue-Mapudungun, isplanning two tools for thenear future: an on-linebilingual lexicon withexamples of usage from acorpus of spokenMapudungun, and aspelling checker forMapudungun built onMySpell, the spellchecking system used bythe open source officesuite OpenOffice. Insupport of these tools weare developing a numberof corpora and lexica.The Corpus of SpokenMapudungunIn the last three years, theChilean Ministry ofEducation and CMU'sAvenue project havesupported the collection of170 hours of spokenMapudungun. Therecordings (all on the topicof health care) have beentranscribed and translatedinto Spanish at theInstituto de EstudiosIndígenas at Universidadde La Frontera. Thecorpus covers threedialects of Mapudungun:120 hours of Nguluche, 30hours of Lafkenche and 20hours of Pewenche. Asmall excerpt from thisspoken Mapudunguncorpus can be found inError: Reference sourcenot found. The corpus isdescribed in more detail inLevin et al. 2002.To betterunderstand the nature ofthis spoken languagecorpus it is interesting tocompare the plots, shownin Error: Reference sourcenot found, of vocabularysize (types) vs. corpus


fullPaper

Download fullPaper
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view fullPaper and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view fullPaper 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?