View Full Document

Clustering and Filtering of Initially Selected Schemes



View the full content.
View Full Document
View Full Document

20 views

Unformatted text preview:

1 Clustering and Filtering of Initially Selected Schemes The bottom up search strategy presented in Chapter 3 is a solid first step toward identifying useful models of productive inflectional paradigms Figure N provides a look at a range of schemes selected during a typical search run Each row of Figure N lists a scheme selected while searching over a Spanish newswire corpus of 50 000 types using the stem ratio metric set at 0 25 see Chapter N On the far left of Figure N the Rank column states the ordinal rank at which that row s scheme was selected during the search procedure the s scheme was the terminal scheme of ParaMor s 1st upward search path a as o os the 2nd ido idos ir ir the 1592nd etc The right four columns of Figure N present raw data on the selected schemes giving the number of c suffixes in that scheme the c suffixes themselves the number of adherent c stems of the scheme and a sample of those c stems Between the rank on the left and the scheme details on the right are columns which categorize the scheme on its success or failure to model a true paradigm of Spanish A dot appears in the columns marked N Adj or Verb if the c suffixes in a row s scheme reasonably align to suffixes in a paradigm of that part of speech The verbal paradigm is further broken down by inflection class ar er or ir A dot appears in the Deriv column if a significant fraction of the c suffixes of a scheme model derivational suffixes The remaining four columns classify the correctness of the row s scheme The Good column is marked if the c suffixes of a scheme align to true paradigmatic suffixes while the three Error Classes Left Right and Chance are described in detail later in this section The schemes of Figure N highlight the successes and shortcomings of the initial search strategy First some successes Many of the schemes selected during the initial search do model true paradigms Initially selected schemes in Figure N that correctly capture real paradigms in part or in total are the 1st 2nd 4th 5th 12th 30th 40th 127th 135th 400th 1592nd and 2000th selected schemes Additionally three paradigms of Spanish are perfectly identified Both phonologic inflection classes of plural on nouns s and es and the adjectival cross product paradigm of gender and number a as o os These three paradigms consist of relatively few relatively common suffixes Finally most true inflectional suffixes are modeled by some scheme selected in the initial search ParaMor s initial search identifies partial paradigms which between them contain 92 of all unique surface suffixes found in the regular inflection classes of Spanish nouns verbs and adjectives This coverage jumps to 98 of unique suffix strings that occurred at least twice in the Spanish newswire corpus Figure N also uncovers two broad shortcomings of the initial search procedure First while most true inflectional suffixes are modeled by some scheme selected in the initial search no initially selected scheme comprehensively models all the suffixes of the larger Spanish para C Suffixes C Stem Count C Suffix Count Chance Right Left Good Adj ar er ir N Deriv Error Type Verb Rank 1 2 3 4 5 10 11 12 20 30 40 100 127 135 200 300 400 1000 1592 2000 3000 4000 5000 Model of C Stems 2 s 5501 apoyada barata hombro oficina reo 4 a as o os 892 apoyad captad dirigid junt pr xim 15 ba ban da das do dos n ndo r ra ron rse r r n 17 apoya disputa lanza lleva toma 2 es 847 emir inseguridad orador pu ramon 15 a aba aban ada adas ado ados an ando ar aron arse ar ar n 25 apoy desarroll disput lanz llev 5 ta tamente tas to tos 22 cier direc ins li modes sangrien 14 ba ci n da das do dos n ndo r ron r r n r a 16 acepta celebra declara fija marca 15 a aba ada adas ado ados an ando ar aron ar ar n e en 21 apoy declar enfrent llev tom 6 l ra ras ro ros 8 a ca coste e ente gi o pu 11 e en ida idas ido idos iendo ieron i a an 16 cumpl escond recib transmit vend 7 es idad izaci n izado izar mente 8 actual final natural penal regular 8 a an e ea en i ino 9 al c ch d g p s t v 9 e en er er er a ido ieron i a 11 ced deb ofrec pertenec suspend 10 a e en ida ido iendo iera ieron i a 12 assist cumpl ocurr permit reun un 4 tal tales te tes 5 acciden ambien continen le pos 4 o os ual uales 7 act cas concept event grad man us 8 a aciones aci n ados an ar ativas 10 administr conmemor estim investig 8 ce cen cer cer n cido cieron ci c a 9 apare estable ofre pertene ven 4 ido idos ir ir 6 conclu cumpl distribu exclu reun 4 e en ieron iesen 5 aparec crec impid invad pud 2 zano 3 li lo man 2 icho io 4 b capr d pred 2 egar ega 3 desp entr ll Figure 2 Suffixes of schemes selected by the initial search algorithm over a Spanish corpus of 50 000 types While some selected schemes contain large numbers of correct suffixes others are incorrect collections of word final strings digms The largest schemes that ParaMor selected from the newswire corpus are the 5th and 12th selected schemes in Figure N Both schemes contain 15 c suffixes modeling suffixes from the ar inflection class of the Spanish verbal paradigm But the three verbal inflection classes of Spanish have more than 35 suffixes apiece Extrapolating to an agglutinative language like Turkish the cross product of several word final paradigms may have an effective size of hundreds or thousands of suffixes But still among the schemes which are selected during the initial search many faithfully describe significant fractions of legitimate paradigms The 5th 12th and 400th selected schemes contain c suffixes which clearly model suffixes from the ar inflection but each contains c suffixes modeling only a subset of the inflection class suffixes Some inflectional suffixes appear in two or more of these selected schemes e g a aba ada others appear in only one e g aban and arse in the 5th selected scheme Looking beyond the schemes listed in Figure N and focusing in on one particular c suffix 31 schemes contain the c suffix ados at the end this Spanish search run including the 5th 12th and 400th selected schemes shown in the figure The search paths that identified these 31 schemes each geminate from a distinct initial c suffix an en aci n amos etc Just as an overlapping patchwork of schemes covers the ar inflection class a separate patchwork covers the ir inflection class Schemes modeling portions of the ir inflection class include …


Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view Clustering and Filtering of Initially Selected Schemes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering and Filtering of Initially Selected Schemes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?