Clustering and Filtering of Initially Selected Schemes

Home> Academic Documents> Clustering and Filtering of Initially Selected Schemes

DOC PREVIEW

This preview shows page 1-2-3-4 out of 11 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1 Clustering and Filtering of Initially Selected Schemes The bottom-up search strategy presented in Chapter 3 is a solid first step toward identifying useful models of productive inflectional paradigms. Figure N provides a look at a range of schemes selected during a typical search run. Each row of Figure N lists a scheme selected while searching over a Spanish newswire corpus of 50,000 types, using the stem ratio metric set at 0.25 (see Chapter N). On the far left of Figure N, the Rank column states the ordinal rank at which that row’s scheme was selected during the search procedure: the Ø.s scheme was the terminal scheme of ParaMor’s 1st upward search path, a.as.o.os the 2nd, ido.idos.ir.iré the 1592nd, etc. The right four columns of Figure N, present raw data on the selected schemes, giving the number of c-suffixes in that scheme, the c-suffixes themselves, the number of adherent c-stems of the scheme, and a sample of those c-stems. Between the rank on the left, and the scheme details on the right, are columns which categorize the scheme on its success, or failure, to model a true paradigm of Spanish. A dot appears in the columns marked N, Adj, or Verb if the c-suffixes in a row’s scheme reasonably align to suffixes in a paradigm of that part of speech. The verbal paradigm is further broken down by inflection class, ar, er, or ir. A dot appears in the Deriv column if a significant fraction of the c-suffixes of a scheme model derivational suffixes. The remaining four columns classify the correctness of the row’s scheme. The Good column is marked if the c-suffixes of a scheme align to true paradigmatic suffixes, while the three Error Classes: Left, Right, and Chance, are described in detail later in this section. The schemes of Figure N highlight the successes and shortcomings of the initial search strategy. First, some successes: Many of the schemes selected during the initial search do model true paradigms. Initially selected schemes in Figure N that correctly capture real paradigms, in part or in total, are the 1st, 2nd, 4th, 5th, 12th, 30th, 40th, 127th, 135th, 400th, 1592nd, and 2000th selected schemes. Additionally, three paradigms of Spanish are perfectly identified: Both phonologic inflection classes of plural on nouns, Ø.s and Ø.es, and the adjectival cross-product paradigm of gender and number, a.as.o.os. These three paradigms consist of relatively few relatively common suffixes. Finally, most true inflectional suffixes are modeled by some scheme selected in the initial search. ParaMor’s initial search identifies partial paradigms which, between them, contain 92% of all unique surface suffixes found in the regular inflection classes of Spanish nouns, verbs, and adjectives. This coverage jumps to 98% of unique suffix strings that occurred at least twice in the Spanish newswire corpus. Figure N also uncovers two broad shortcomings of the initial search procedure. First, while most true inflectional suffixes are modeled by some scheme selected in the initial search, no initially selected scheme comprehensively models all the suffixes of the larger Spanish para-Model of Error Type Verb Rank N Adj ar er ir Deriv Good Left Right Chance C-Suffix Count C-Suffixes C-Stem Count C-Stems 1 ● ● ● 2 Ø.s 5501 apoyada, barata, hombro, oficina, reo … 2 ● ● 4 a.as.o.os 892 apoyad, captad, dirigid, junt, próxim, … 3 ● ● 15 Ø.ba.ban.da.das.do.dos.n.ndo.r.ra.ron.rse.rá.rán 17 apoya, disputa, lanza, lleva, toma, … 4 ● ● ● 2 Ø.es 847 emir, inseguridad, orador, pu, ramon, … 5 ● ● 15 a.aba.aban.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.ó 25 apoy, desarroll, disput, lanz, llev, … 10 ● ● 5 ta.tamente.tas.to.tos 22 cier, direc, insóli, modes, sangrien, … 11 ● ● 14 Ø.ba.ción.da.das.do.dos.n.ndo.r.ron.rá.rán.ría 16 acepta, celebra, declara, fija, marca, … 12 ● ● 15 a.aba.ada.adas.ado.ados.an.ando.ar.aron.ará.arán.e.en.ó 21 apoy, declar, enfrent, llev, tom, … 20 ● ● ● 6 Ø.l.ra.ras.ro.ros 8 a, ca, coste, e, ente, gi, o, pu 30 ● ● ● 11 e.en.ida.idas.ido.idos.iendo.ieron.ió.ía.ían 16 cumpl, escond, recib, transmit, vend, … 40 ● ● ● ● 7 Ø.es.idad.ización.izado.izar.mente 8 actual, final, natural, penal, regular, … 100 ● 8 Ø.a.an.e.ea.en.i.ino 9 al, c, ch, d, g, p, s, t, v 127 ● ● 9 e.en.er.erá.ería.ido.ieron.ió.ía 11 ced, deb, ofrec, pertenec, suspend, … 135 ● ● ● 10 a.e.en.ida.ido.iendo.iera.ieron.ió.ía 12 assist, cumpl, ocurr, permit, reun, un, … 200 ● ● 4 tal.tales.te.tes 5 acciden, ambien, continen, le, pos 300 ● ● 4 o.os.ual.uales 7 act, cas, concept, event, grad, man, us 400 ● ● ● 8 a.aciones.ación.ados.an.ar.ativas.ó 10 administr, conmemor, estim, investig, … 1000 ● ● 8 ce.cen.cer.cerán.cido.cieron.ció.cía 9 apare, estable, ofre, pertene, ven, … 1592 ● ● 4 ido.idos.ir.iré 6 conclu, cumpl, distribu, exclu, reun, … 2000 ● ● ● 4 e.en.ieron.iesen 5 aparec, crec, impid, invad, pud 3000 ● 2 Ø.zano 3 li, lo, man 4000 ● 2 icho.io 4 b, capr, d, pred 5000 ● ● 2 egará.ega 3 desp, entr, ll … … … … … Figure 2: Suffixes of schemes selected by the initial search algorithm over a Spanish corpus of 50,000 types. While some selected schemes contain large numbers of correct suffixes, others are incorrect collections of word final strings.3 digms. The largest schemes that ParaMor selected from the newswire corpus are the 5th and 12th selected schemes in Figure N. Both schemes contain 15 c-suffixes modeling suffixes from the ar inflection class of the Spanish verbal paradigm. But the three verbal inflection classes of Spanish have more than 35 suffixes apiece. Extrapolating to an agglutinative language like Turkish, the cross-product of several word-final paradigms may have an effective size of hundreds or thousands of


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 11 pages.

Please select your school