CMSC 723 / LING 645: Intro to Computational LinguisticsComputational Morphology (continued)Lexicon-Free Morphology: Porter StemmerPorter StemmerSlide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Why (not) Statistics for NLP?Weighted Automata/TransducersPronunciation network for “about”Noisy ChannelProbability DefinitionsMore DefinitionsConditional ProbabilityIndependenceBayes TheoremWhat does this have to do with the Noisy Channel Model?Noisy Channel Applied to Word RecognitionWhat is the most likely word given [ni]?Why N-grams?Next Word Prediction [borrowed from J. Hirschberg]Next Word Prediction (cont)Human Word PredictionClaimWhy would we want to do this?Why is this useful?Handwriting RecognitionReal Word Spelling ErrorsFor Spell CheckersLanguage ModelWord Prediction: Simple vs. SmartChain RuleRelative Frequencies and Conditional ProbabilitiesFor a Word StringMarkov AssumptionCounting Words in CorporaCorporaTraining and TestingTerminologySimple N-GramsUsing N-GramsA Bigram Grammar Fragment from BERPAdditional BERP GrammarComputing Sentence ProbabilityBERP Bigram CountsBERP Bigram Probabilities: Use Unigram CountLearning a Bigram GrammarWhat do we learn about the language?Readings for next timeCMSC 723 / LING 645: Intro to Computational LinguisticsSeptember 22, 2004: DorrPorter Stemmer,Intro to Probabilistic NLP and N-grams (chap 6.1-6.3)Prof. Bonnie J. DorrDr. Christof MonzTA: Adam LeeComputational Morphology (continued)The Rules and the Lexicon–General versus Specific–Regular versus Irregular–Accuracy, speed, space–The Morphology of a language Approaches–Lexicon only–Lexicon and Rules•Finite-state Automata•Finite-state Transducers–Rules onlyLexicon-Free Morphology:Porter StemmerLexicon-Free FST ApproachBy Martin Porter (1980)http://www.tartarus.org/%7Emartin/PorterStemmer/ Cascade of substitutions given specific conditionsGENERALIZATIONS GENERALIZATIONGENERALIZEGENERALGENERPorter StemmerDefinitions–C = string of one or more consonants, where a consonant is anything other than A E I O U or (Y preceded by C)–V = string of one or more vowels–M = Measure, roughly with number of syllables–Words = (C)*(V*C*)M(V)*• M=0 TR, EE, TREE, Y, BY• M=1 TROUBLE, OATS, TREES, IVY• M=2 TROUBLES, PRIVATE, OATEN, ORRERYConditions–*S - stem ends with S–*v* - stem contains a V–*d - stem ends with double C, e.g., -TT, -SS –*o - stem ends CVC, where second C is not W, X or Y, e.g., -WIL, HOPPorter StemmerStep 1: Plural Nouns and Third Person Singular VerbsSSES SS caresses caressIES I ponies poni ties tiSS SS caress caressS cats cat*<S> = ends with <S> *v* = contains a V*d = ends with double C*o = ends with CVC second C is not W, X or YStep 2a: Verbal Past Tense and Progressive Forms(M>0) EED EE feed feed, agreed agreei (*v*) ED plastered plaster, bled bledii (*v*) ING motoring motor, sing singStep 2b: If 2a.i or 2a.ii is successful, Cleanup AT ATE conflat(ed) conflate BL BLE troubl(ed) trouble IZ IZE siz(ed) size(*d and not (*L or *S or *Z)) hopp(ing) hop, tann(ed) tan single letter hiss(ing) hiss, fizz(ed) fizz (M=1 and *o) E fail(ing) fail, fil(ing) filePorter StemmerStep 3: Y I(*v*) Y I happy happi sky sky*<S> = ends with <S> *v* = contains a V*d = ends with double C*o = ends with CVC second C is not W, X or YPorter StemmerStep 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensiblePorter StemmerStep 5: Derivational Morphology II: More Multiple Suffixes (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> goodPorter StemmerStep 6: Derivational Morphology III: Single Suffixes (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1)
View Full Document