Unformatted text preview:

1CMSC 723 / LING 645: Intro to Computational LinguisticsSeptember 22, 2004: DorrPorter Stemmer,Intro to Probabilistic NLP and N-grams (chap 6.1-6.3)Prof. Bonnie J. DorrDr. Christof MonzTA: Adam LeeComputational Morphology (continued)The Rules and the Lexicon– General versus Specific– Regular versus Irregular– Accuracy, speed, space– The Morphology of a language Approaches– Lexicon only– Lexicon and Rules• Finite-state Automata• Finite-state Transducers– Rules onlyLexicon-Free Morphology:Porter StemmerLexicon-Free FST ApproachBy Martin Porter (1980)http://www.tartarus.org/%7Emartin/PorterStemmer/Cascade of substitutions given specific conditionsGENERALIZATIONS GENERALIZATIONGENERALIZEGENERALGENERPorter Stemmer Definitions– C = string of one or more consonants, where a consonant is anything other than A E I O U or (Y preceded by C)– V = string of one or more vowels– M = Measure, roughly with number of syllables– Words = (C)*(V*C*)M(V)*• M=0 TR, EE, TREE, Y, BY• M=1 TROUBLE, OATS, TREES, IVY• M=2 TROUBLES, PRIVATE, OATEN, ORRERY Conditions– *S - stem ends with S– *v* - stem contains a V– *d - stem ends with double C, e.g., -TT, -SS – *o - stem ends CVC, where second C is not W, X or Y, e.g., -WIL, HOPPorter StemmerStep 1: Plural Nouns and Third Person Singular VerbsSSES  SS caresses  caressIES  I ponies  ponities  tiSS  SS caress  caressS  cats  cat*<S> = ends with <S>*v* = contains a V*d = ends with double C*o = ends with CVC second C is not W, X or YStep 2a: Verbal Past Tense and Progressive Forms(M>0) EED  EE feed  feed, agreed  agreei (*v*) ED  plastered  plaster, bled  bledii (*v*) ING  motoring  motor, sing  singStep 2b: If 2a.i or 2a.ii is successful, CleanupAT  ATE conflat(ed)  conflateBL  BLE troubl(ed)  troubleIZ  IZE siz(ed)  size(*d and not (*L or *S or *Z)) hopp(ing)  hop, tann(ed)  tan single letter hiss(ing)  hiss, fizz(ed)  fizz (M=1 and *o)  E fail(ing)  fail, fil(ing)  filePorter StemmerStep 3: Y  I(*v*) Y  I happy  happisky  sky*<S> = ends with <S>*v* = contains a V*d = ends with double C*o = ends with CVC second C is not W, X or Y2Porter StemmerStep 4: Derivational Morphology I: Multiple Suffixes(m>0) ATIONAL -> ATE relational -> relate(m>0) TIONAL -> TION conditional -> conditionrational -> rational(m>0) ENCI -> ENCE valenci -> valence(m>0) ANCI -> ANCE hesitanci -> hesitance(m>0) IZER -> IZE digitizer -> digitize(m>0) ABLI -> ABLE conformabli -> conformable(m>0) ALLI -> AL radicalli -> radical(m>0) ENTLI -> ENT differentli -> different(m>0) ELI -> E vileli - > vile(m>0) OUSLI -> OUS analogousli -> analogous(m>0) IZATION -> IZE vietnamization -> vietnamize(m>0) ATION -> ATE predication -> predicate(m>0) ATOR -> ATE operator -> operate(m>0) ALISM -> AL feudalism -> feudal(m>0) IVENESS -> IVE decisiveness -> decisive(m>0) FULNESS -> FUL hopefulness -> hopeful(m>0) OUSNESS -> OUS callousness -> callous(m>0) ALITI -> AL formaliti -> formal(m>0) IVITI -> IVE sensitiviti -> sensitive(m>0) BILITI -> BLE sensibiliti -> sensiblePorter StemmerStep 5: Derivational Morphology II: More Multiple Suffixes(m>0) ICATE -> IC triplicate -> triplic(m>0) ATIVE -> formative -> form(m>0) ALIZE -> AL formalize -> formal(m>0) ICITI -> IC electriciti -> electric(m>0) ICAL -> IC electrical -> electric(m>0) FUL -> hopeful -> hope(m>0) NESS -> goodness -> goodPorter StemmerStep 6: Derivational Morphology III: Single Suffixes(m>1) AL -> revival -> reviv(m>1) ANCE -> allowance -> allow(m>1) ENCE -> inference -> infer(m>1) ER -> airliner -> airlin(m>1) IC -> gyroscopic -> gyroscop(m>1) ABLE -> adjustable -> adjust(m>1) IBLE -> defensible -> defens(m>1) ANT -> irritant -> irrit(m>1) EMENT -> replacement -> replac(m>1) MENT -> adjustment -> adjust(m>1) ENT -> dependent -> depend(m>1 and (*S or *T)) ION -> adoption -> adopt(m>1) OU -> homologou -> homolog(m>1) ISM -> communism -> commun(m>1) ATE -> activate -> activ(m>1) ITI -> angulariti -> angular(m>1) OUS -> homologous -> homolog(m>1) IVE -> effective -> effect(m>1) IZE -> bowdlerize -> bowdler*<S> = ends with <S>*v* = contains a V*d = ends with double C*o = ends with CVC second C is not W, X or YPorter StemmerStep 7a: Cleanup(m>1) E  probate  probatrate  rate(m=1 and not *o) E  cease  ceasStep 7b: More Cleanup(m > 1 and *d and *L) controll  control single letter roll  roll*<S> = ends with <S>*v* = contains a V*d = ends with double C*o = ends with CVC second C is not W, X or YPorter Stemmer Errors of Omission– European Europe– analysis analyzes– matrices matrix– noise noisy– explain explanation Errors of Commission– organization organ– doing doe– generalization generic– numerical numerous– university universeFrom Krovetz ‘93Why (not) Statistics for NLP?Pro– Disambiguation– Error Tolerant– LearnableCon– Not always appropriate– Difficult to debug3Weighted Automata/Transducers Speech recognition: storing a pronunciation lexicon Augmentation of FSA: Each arc is associated with a probabilityPronunciation network for “about”Noisy Channel Probability DefinitionsExperiment (trial)– Repeatable procedure with well-defined possible outcomesSample space– Complete set of outcomesEvent– Any subset of outcomes from sample spaceRandom Variable– Uncertain outcome in a trialMore


View Full Document

UMD CMSC 723 - Intro to Computational Linguistics

Documents in this Course
Lecture 9

Lecture 9

12 pages

Smoothing

Smoothing

15 pages

Load more
Download Intro to Computational Linguistics
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Intro to Computational Linguistics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Intro to Computational Linguistics 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?