SWARTHMORE CS 97 - Developing a Morphological Segmenter for Russian - D2577733

Home> Schools> Swarthmore College> (CS) > CS 97> Developing a Morphological Segmenter for Russian

DOC PREVIEW

SWARTHMORE CS 97 - Developing a Morphological Segmenter for Russian

School name Swarthmore College

Course Cs 97- Computer Perception

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Appeared in: Proceedings of the Class of 2005 Senior Conference, pages 28–33Computer Science Department, Swarthmore CollegeDeveloping a Morphological Segmenter for RussianAmerica L. HollowaySwarthmore CollegeSwarthmore, PA [email protected] paper presents an algorithm for de-veloping a morphological segmenter forRussian. The segmenter can find multipleprefixes and suffixes for any given word.Therefore it is more suitable for a highlyinflected language than a segmenter thatis limited to at most one prefix or suffix.The segmenter requires a small hand seg-mented corpus to bootstrap from, and alarger unsegmented corpus from which tolearn. The algorithm uses trigram proba-bilities, and Witten-Bell smoothing to pre-dict the correct segmentation of a word. Afiltering step is also used to weed out badsegmentations.1 IntroductionMany languages, including Russian and Arabic,have a richer morphology than is found in English.In Russian, not only do verb endings change to re-flect person, gender and number, noun endings alsochange (or in some cases are truncated) to reflectcase. For example, the ending a is appended to amasculine noun to form the genitive singular. Fur-thermore, a word in Russian can often times be de-composed into smaller units, or morphemes, each ofwhich carries its own meaning. These morphemescontribute to, and refine, the meaning of the en-tire word. For example, the noun predsedatel(predsedatil) means ’representative’. Literally, it canbe translated as ’the one’ (el) ’who sits’ (sidet)’before’ (pred), or more figuratively, ’the one whorepresents’ us. In general, it is not rare for a wordto have multiple prefixes and suffixes. Multiple suf-fixes, in particular, are common. As an example,reflexive verbs will always have two suffixes. Thefirst suffix -s (-cya) indicates it is a reflexive verb,and the second suffix indicates what type of verb itis. These suffixes include -ova (-ova), -at(-at)and -it (-ut). Thus, to capture the morphology ofsuch a language, it is important that any morpholog-ical analyzer be able to recognize multiple prefixesand suffixes.The algorithm presented in this paper is adaptedfrom the morphological segmenter for Arabic cre-ated by (Lee et al., 2003). Many existing morpho-logical analyzers, for example (Goldsmith, 2000),identify only single suffixes. This type of systemfails to capture the entire morphology of Russian.Recognizing multiple prefixes and suffixes is espe-cially important for tasks such as aligning corpora,information retrieval and machine translation. Thisis because one Russian word may correspond tomultiple words in a different language. Thus, ourgoal was to implement a segmenter that (1) couldidentify multiple affixes and (2) required few re-sources, in order to create a superior morphologicalanalyzer specifically for Russian.Our system requires only a small hand-segmentedcorpus to bootstrap the segmenter, and a larger, un-segmented corpus from which to gain new stems.For any Russian word, all possible segmentationsare enumerated and the trigram probability of each iscomputed. The highest scoring segmentation is cho-sen as the correct one. The system performance is28Appeared in: Proceedings of the Class of 2005 Senior Conference, pages 28–33Computer Science Department, Swarthmore Collegesurprisingly good given the small corpus and simplealgorithm.2 Related WorkAs stated, this algorithm draws heavily from (Lee etal., 2003). They present a morphological segmenterfor Arabic which identifies multiple prefixes andsuffixes and requires only a small hand segmentedcorpus (110,000 words) and a large unsegmentedcorpus (155 million words). They also supplementtheir segmenter with an additional prefix/suffix list.The large unsegmented corpus is used to acquirenew stems. They first divide the corpus into parti-tions. For each word, all possible segmentations areenumerated and the segmentations with the highestprobabilities are kept. After each partition, the tri-gram probabilities are recomputed to take into ac-count the new stems found. Each stem is also sub-jected to further testing to ensure that it does notcontain a prefix or suffix. Stems are added to the listbased upon the stem frequency (i.e. the number oftimes they are seen), the probability that a substringof the stem is a prefix or suffix, and contextual infor-mation. With just trigram probabilities alone, (Leeet al., 2003) are able to reduce the error from thebaseline performance by half.(Goldsmith, 2000) uses the notion of minimumdescription length (MDL) to implement a morpho-logical segmenter, Linguistica, that is quite success-ful. Linguistica takes only a corpus and returns a listof stems, a list of suffixes and a list of signatures.A signature is a set of suffixes which can appear onthe end of a stem. An example signature is the set( -NULL, -s ). There are many stems, such as ap-ple or cow, that are associated with this signature. Afirst analysis of the corpus can be as simple as split-ting every word after each letter. Other heuristics arethen employed to shrink the list of signatures.Minimum description length is based on the no-tion that the number of letters in the morphologicalanalysis of a corpus (e.g. a list of stems, suffixesand signatures) will be less than the number of let-ters in the original corpus. Accordingly, Goldsmithdevelops a description length to measure the size ofthe morphological analysis of a corpus. That is, hecreates a description length to measure the size ofthe stem list, suffix list and signature list. Aftereach heuristic is applied, the description length iscomputed. If the description length has decreased,the analysis is kept. Notably, Linguistica identi-fies only one suffix per word. For example, if theword breathings occurred in our corpus, the stemwould be breathing and the suffix would be -s1.Thus breathings would be associated with the sig-nature given above. Recall of 85.9% and precisionof 90.4% is achieved for English.Work using multilingual corpora to aid inmorphological analysis has also been performed.(Yarowsky et al., 2001) use a lemmatizer and mul-tilingual corpora to achieve a precision over 98%on a French corpus of 1.2 million words. (Hana etal., 2004) use a Czech-Russian aligned corpus. Thesystem combines information from their own mor-phological segmenter, the Czech corpus and a partof speech tagger. Instead of detecting multiple pre-fixes or suffixes, they use the

View Full Document