CMSC 723 LING 645 Intro to Computational Linguistics September 22 2004 Dorr Porter Stemmer Intro to Probabilistic NLP and N grams chap 6 1 6 3 Prof Bonnie J Dorr Dr Christof Monz TA Adam Lee Computational Morphology continued The Rules and the Lexicon General versus Specific Regular versus Irregular Accuracy speed space The Morphology of a language Approaches Lexicon only Lexicon and Rules Finite state Automata Finite state Transducers Rules only Lexicon Free Morphology Porter Stemmer Lexicon Free FST Approach By Martin Porter 1980 http www tartarus org 7Emartin PorterStemmer Cascade of substitutions given specific conditions GENERALIZATIONS GENERALIZATION GENERALIZE GENERAL GENER Porter Stemmer Definitions C string of one or more consonants where a consonant is anything other than A E I O U or Y preceded by C V string of one or more vowels M Measure roughly with number of syllables Words C V C M V M 0 TR EE TREE Y BY M 1 TROUBLE OATS TREES IVY M 2 TROUBLES PRIVATE OATEN ORRERY Conditions S stem ends with S v stem contains a V d stem ends with double C e g TT SS o stem ends CVC where second C is not W X or Y e g WIL HOP S ends with S v contains a V Porter Stemmer d ends with double C o ends with CVC second C is not W X or Y Step 1 Plural Nouns and Third Person Singular Verbs SSES SS IES I caresses ponies ties caress cats SS SS S caress poni ti caress cat Step 2a Verbal Past Tense and Progressive Forms M 0 EED EE feed feed agreed agree i v ED ii v ING plastered motoring plaster bled motor sing sing Step 2b If 2a i or 2a ii is successful Cleanup AT ATE conflat ed conflate BL BLE troubl ed trouble IZ IZE siz ed size d and not L or S or Z single letter M 1 and o E hopp ing hop tann ed tan hiss ing hiss fizz ed fizz fail ing fail fil ing file S ends with S Porter Stemmer v contains a V d ends with double C o ends with CVC second C is not W X or Y Step 3 Y I v Y I happy happi sky sky Porter Stemmer Step 4 Derivational Morphology I Multiple Suffixes m 0 ATIONAL m 0 TIONAL ATE TION m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 ENCE ANCE IZE ABLE AL ENT E OUS IZE ATE ATE AL IVE FUL OUS AL IVE BLE ENCI ANCI IZER ABLI ALLI ENTLI ELI OUSLI IZATION ATION ATOR ALISM IVENESS FULNESS OUSNESS ALITI IVITI BILITI relational conditional rational valenci hesitanci digitizer conformabli radicalli differentli vileli analogousli vietnamization predication operator feudalism decisiveness hopefulness callousness formaliti sensitiviti sensibiliti relate condition rational valence hesitance digitize conformable radical different vile analogous vietnamize predicate operate feudal decisive hopeful callous formal sensitive sensible Porter Stemmer Step 5 Derivational Morphology II More Multiple Suffixes m 0 m 0 m 0 m 0 m 0 m 0 m 0 ICATE ATIVE ALIZE ICITI ICAL FUL NESS IC AL IC IC triplicate formative formalize electriciti electrical hopeful goodness triplic form formal electric electric hope good Porter Stemmer S ends with S v contains a V d ends with double C o ends with CVC second C is not W X or Y Step 6 Derivational Morphology III Single Suffixes m 1 AL m 1 ANCE m 1 ENCE m 1 ER m 1 IC m 1 ABLE m 1 IBLE m 1 ANT m 1 EMENT m 1 MENT m 1 ENT m 1 and S or T ION m 1 OU m 1 ISM m 1 ATE m 1 ITI m 1 OUS m 1 IVE m 1 IZE revival allowance inference airliner gyroscopic adjustable defensible irritant replacement adjustment dependent adoption homologou communism activate angulariti homologous effective bowdlerize reviv allow infer airlin gyroscop adjust defens irrit replac adjust depend adopt homolog commun activ angular homolog effect bowdler Porter Stemmer S ends with S v contains a V d ends with double C o ends with CVC second C is not W X or Y Step 7a Cleanup m 1 E m 1 and not o E probate probat rate rate cease ceas Step 7b More Cleanup m 1 and d and L single letter controll control roll roll Porter Stemmer Errors of Omission European analysis matrices noise explain Europe analyzes matrix noisy explanation Errors of Commission organization doing generalization numerical university From Krovetz 93 organ doe generic numerous universe Why not Statistics for NLP Pro Disambiguation Error Tolerant Learnable Con Not always appropriate Difficult to debug Weighted Automata Transducers Speech recognition storing a pronunciation lexicon Augmentation of FSA Each arc is associated with a probability Pronunciation network for about Noisy Channel Probability Definitions Experiment trial Repeatable procedure with well defined possible outcomes Sample space Complete set of outcomes Event Any subset of outcomes from sample space Random Variable Uncertain outcome in a trial More Definitions Probability How likely is it to get a particular outcome Rate of getting that outcome in all trials Probability of drawing a spade from 52 well shuffled playing cards Distribution Probabilities associated with each outcome a random variable can take Each outcome has probability between 0 and 1 The sum of all outcome probabilities is 1 Conditional Probability What is P A B First what is P A P It is raining 06 Now what about P A B P It is raining It was clear 10 minutes ago 004 P A B P A B P B A A B B Note P A B P A B P B Also P A B P B A Independence What is P A B if A and B are independent P A B P A P B iff A B independent P heads tails P heads P tails 5 5 25 P doctor blue eyes P doctor P blue eyes 01 2 002 What if A B independent P A B P A iff A B independent Also P B A P B iff A B independent Bayes Theorem P A B P B P B A P A Swap the order of dependence Sometimes easier to estimate one kind of dependence than the other What does this have to do with the Noisy Channel Model O H P H P O H Best H Best H argmax P H O argmax P O H P H H H P O likelihood prior Noisy Channel Applied to Word Recognition argmaxw P w O argmaxw P O w P w Simplifying assumptions pronunciation string correct word boundaries known Problem Given n iy what is correct dictionary word What do we need ni knee neat need new What is the most likely word given ni Compute prior P w Word freq w P w new 2625 001 neat 338 00013 need 1417 00056 knee 61 000024 Now compute likelihood P ni w then multiply Word P …
View Full Document
Unlocking...