1CMSC 723 / LING 645: Intro to Computational LinguisticsSeptember 22, 2004: DorrPorter Stemmer,Intro to Probabilistic NLP and N-grams (chap 6.1-6.3)Prof. Bonnie J. DorrDr. Christof MonzTA: Adam LeeComputational Morphology (continued)The Rules and the Lexicon– General versus Specific– Regular versus Irregular– Accuracy, speed, space– The Morphology of a language Approaches– Lexicon only– Lexicon and Rules• Finite-state Automata• Finite-state Transducers– Rules onlyLexicon-Free Morphology:Porter StemmerLexicon-Free FST ApproachBy Martin Porter (1980)http://www.tartarus.org/%7Emartin/PorterStemmer/Cascade of substitutions given specific conditionsGENERALIZATIONS GENERALIZATIONGENERALIZEGENERALGENERPorter Stemmer Definitions– C = string of one or more consonants, where a consonant is anything other than A E I O U or (Y preceded by C)– V = string of one or more vowels– M = Measure, roughly with number of syllables– Words = (C)*(V*C*)M(V)*• M=0 TR, EE, TREE, Y, BY• M=1 TROUBLE, OATS, TREES, IVY• M=2 TROUBLES, PRIVATE, OATEN, ORRERY Conditions– *S - stem ends with S– *v* - stem contains a V– *d - stem ends with double C, e.g., -TT, -SS – *o - stem ends CVC, where second C is not W, X or Y, e.g., -WIL, HOPPorter StemmerStep 1: Plural Nouns and Third Person Singular VerbsSSES SS caresses caressIES I ponies ponities tiSS SS caress caressS cats cat*<S> = ends with <S>*v* = contains a V*d = ends with double C*o = ends with CVC second C is not W, X or YStep 2a: Verbal Past Tense and Progressive Forms(M>0) EED EE feed feed, agreed agreei (*v*) ED plastered plaster, bled bledii (*v*) ING motoring motor, sing singStep 2b: If 2a.i or 2a.ii is successful, CleanupAT ATE conflat(ed) conflateBL BLE troubl(ed) troubleIZ IZE siz(ed) size(*d and not (*L or *S or *Z)) hopp(ing) hop, tann(ed) tan single letter hiss(ing) hiss, fizz(ed) fizz (M=1 and *o) E fail(ing) fail, fil(ing) filePorter StemmerStep 3: Y I(*v*) Y I happy happisky sky*<S> = ends with <S>*v* = contains a V*d = ends with double C*o = ends with CVC second C is not W, X or Y2Porter StemmerStep 4: Derivational Morphology I: Multiple Suffixes(m>0) ATIONAL -> ATE relational -> relate(m>0) TIONAL -> TION conditional -> conditionrational -> rational(m>0) ENCI -> ENCE valenci -> valence(m>0) ANCI -> ANCE hesitanci -> hesitance(m>0) IZER -> IZE digitizer -> digitize(m>0) ABLI -> ABLE conformabli -> conformable(m>0) ALLI -> AL radicalli -> radical(m>0) ENTLI -> ENT differentli -> different(m>0) ELI -> E vileli - > vile(m>0) OUSLI -> OUS analogousli -> analogous(m>0) IZATION -> IZE vietnamization -> vietnamize(m>0) ATION -> ATE predication -> predicate(m>0) ATOR -> ATE operator -> operate(m>0) ALISM -> AL feudalism -> feudal(m>0) IVENESS -> IVE decisiveness -> decisive(m>0) FULNESS -> FUL hopefulness -> hopeful(m>0) OUSNESS -> OUS callousness -> callous(m>0) ALITI -> AL formaliti -> formal(m>0) IVITI -> IVE sensitiviti -> sensitive(m>0) BILITI -> BLE sensibiliti -> sensiblePorter StemmerStep 5: Derivational Morphology II: More Multiple Suffixes(m>0) ICATE -> IC triplicate -> triplic(m>0) ATIVE -> formative -> form(m>0) ALIZE -> AL formalize -> formal(m>0) ICITI -> IC electriciti -> electric(m>0) ICAL -> IC electrical -> electric(m>0) FUL -> hopeful -> hope(m>0) NESS -> goodness -> goodPorter StemmerStep 6: Derivational Morphology III: Single Suffixes(m>1) AL -> revival -> reviv(m>1) ANCE -> allowance -> allow(m>1) ENCE -> inference -> infer(m>1) ER -> airliner -> airlin(m>1) IC -> gyroscopic -> gyroscop(m>1) ABLE -> adjustable -> adjust(m>1) IBLE -> defensible -> defens(m>1) ANT -> irritant -> irrit(m>1) EMENT -> replacement -> replac(m>1) MENT -> adjustment -> adjust(m>1) ENT -> dependent -> depend(m>1 and (*S or *T)) ION -> adoption -> adopt(m>1) OU -> homologou -> homolog(m>1) ISM -> communism -> commun(m>1) ATE -> activate -> activ(m>1) ITI -> angulariti -> angular(m>1) OUS -> homologous -> homolog(m>1) IVE -> effective -> effect(m>1) IZE -> bowdlerize -> bowdler*<S> = ends with <S>*v* = contains a V*d = ends with double C*o = ends with CVC second C is not W, X or YPorter StemmerStep 7a: Cleanup(m>1) E probate probatrate rate(m=1 and not *o) E cease ceasStep 7b: More Cleanup(m > 1 and *d and *L) controll control single letter roll roll*<S> = ends with <S>*v* = contains a V*d = ends with double C*o = ends with CVC second C is not W, X or YPorter Stemmer Errors of Omission– European Europe– analysis analyzes– matrices matrix– noise noisy– explain explanation Errors of Commission– organization organ– doing doe– generalization generic– numerical numerous– university universeFrom Krovetz ‘93Why (not) Statistics for NLP?Pro– Disambiguation– Error Tolerant– LearnableCon– Not always appropriate– Difficult to debug3Weighted Automata/Transducers Speech recognition: storing a pronunciation lexicon Augmentation of FSA: Each arc is associated with a probabilityPronunciation network for “about”Noisy Channel Probability DefinitionsExperiment (trial)– Repeatable procedure with well-defined possible outcomesSample space– Complete set of outcomesEvent– Any subset of outcomes from sample spaceRandom Variable– Uncertain outcome in a trialMore
View Full Document