Unformatted text preview:

CMSC 723 LING 645 Intro to Computational Linguistics Computational Morphology continued The Rules and the Lexicon September 22 2004 Dorr Porter Stemmer Intro to Probabilistic NLP and N grams chap 6 1 6 3 General versus Specific Regular versus Irregular Accuracy speed space The Morphology of a language Approaches Lexicon only Lexicon and Rules Prof Bonnie J Dorr Dr Christof Monz TA Adam Lee Finite state Automata Finite state Transducers Rules only Lexicon Free Morphology Porter Stemmer Porter Stemmer Definitions Lexicon Free FST Approach By Martin Porter 1980 http www tartarus org 7Emartin PorterStemmer Cascade of substitutions given specific conditions C string of one or more consonants where a consonant is anything other than A E I O U or Y preceded by C V string of one or more vowels M Measure roughly with number of syllables Words C V C M V GENERALIZATIONS GENERALIZATION GENERALIZE GENERAL GENER M 0 TR EE TREE Y BY M 1 TROUBLE OATS TREES IVY M 2 TROUBLES PRIVATE OATEN ORRERY Conditions Porter Stemmer S stem ends with S v stem contains a V d stem ends with double C e g TT SS o stem ends CVC where second C is not W X or Y e g WIL HOP S ends with S S ends with S v contains a V v contains a V d ends with double C Porter Stemmer d ends with double C o ends with CVC o ends with CVC second C is not W X or Y second C is not W X or Y Step 1 Plural Nouns and Third Person Singular Verbs SSES IES SS S SS caresses ponies ties caress cats I SS Step 3 Y caress poni ti caress cat v Y I I happy sky happi sky Step 2a Verbal Past Tense and Progressive Forms M 0 EED EE feed feed agreed i v ED plastered ii v ING motoring agree plaster bled motor sing bled sing Step 2b If 2a i or 2a ii is successful Cleanup AT ATE conflat ed BL BLE troubl ed IZ IZE siz ed d and not L or S or Z M 1 and o hopp ing conflate trouble size hop tann ed tan single letter hiss ing hiss fizz ed fizz E fail ing fail fil ing file 1 Porter Stemmer Porter Stemmer Step 4 Derivational Morphology I Multiple Suffixes m 0 ATIONAL m 0 TIONAL ATE TION m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 m 0 ENCE ANCE IZE ABLE AL ENT E OUS IZE ATE ATE AL IVE FUL OUS AL IVE BLE ENCI ANCI IZER ABLI ALLI ENTLI ELI OUSLI IZATION ATION ATOR ALISM IVENESS FULNESS OUSNESS ALITI IVITI BILITI relational conditional rational valenci hesitanci digitizer conformabli radicalli differentli vileli analogousli vietnamization predication operator feudalism decisiveness hopefulness callousness formaliti sensitiviti sensibiliti relate condition rational valence hesitance digitize conformable radical different vile analogous vietnamize predicate operate feudal decisive hopeful callous formal sensitive sensible Step 5 Derivational Morphology II More Multiple Suffixes m 0 m 0 m 0 m 0 m 0 m 0 m 0 ICATE ATIVE ALIZE ICITI ICAL FUL NESS IC AL IC IC triplicate formative formalize electriciti electrical hopeful goodness S ends with S Porter Stemmer v contains a V S ends with S Porter Stemmer v contains a V d ends with double C d ends with double C o ends with CVC o ends with CVC second C is not W X or Y Step 6 Derivational Morphology III Single Suffixes m 1 AL m 1 ANCE m 1 ENCE m 1 ER m 1 IC m 1 ABLE m 1 IBLE m 1 ANT m 1 EMENT m 1 MENT m 1 ENT m 1 and S or T ION m 1 OU m 1 ISM m 1 ATE m 1 ITI m 1 OUS m 1 IVE m 1 IZE revival allowance inference airliner gyroscopic adjustable defensible irritant replacement adjustment dependent adoption homologou communism activate angulariti homologous effective bowdlerize Porter Stemmer Errors of Omission European analysis matrices noise explain Europe analyzes matrix noisy explanation Errors of Commission organization doing generalization numerical university organ doe generic numerous universe triplic form formal electric electric hope good reviv allow infer airlin gyroscop adjust defens irrit replac adjust depend adopt homolog commun activ angular homolog effect bowdler second C is not W X or Y Step 7a Cleanup m 1 E m 1 and not o E probate rate cease probat rate ceas Step 7b More Cleanup m 1 and d and L single letter controll control roll roll Why not Statistics for NLP Pro Disambiguation Error Tolerant Learnable Con Not always appropriate Difficult to debug From Krovetz 93 2 Weighted Automata Transducers Pronunciation network for about Speech recognition storing a pronunciation lexicon Augmentation of FSA Each arc is associated with a probability Noisy Channel Probability Definitions Experiment trial Repeatable procedure with well defined possible outcomes Sample space Complete set of outcomes Event Any subset of outcomes from sample space Random Variable Uncertain outcome in a trial More Definitions Conditional Probability Probability What is P A B First what is P A How likely is it to get a particular outcome Rate of getting that outcome in all trials P It is raining 06 Now what about P A B Probability of drawing a spade from 52 well shuffled playing cards Distribution Probabilities associated with each outcome a random variable can take Each outcome has probability between 0 and 1 The sum of all outcome probabilities is 1 P It is raining It was clear 10 minutes ago 004 P A B P A B P B A A B B Note P A B P A B P B Also P A B P B A 3 Independence Bayes Theorem What is P A B if A and B are independent P B A P A B P A P B iff A B independent P heads tails P heads P tails 5 5 25 P doctor blue eyes P doctor P blue eyes 01 2 002 What if A B independent P A B P A iff A B independent Also P B A P B iff A B independent What does this have to do with the Noisy Channel Model P A B P B P A Swap the order of dependence Sometimes easier to estimate one kind of dependence than the other Noisy Channel Applied to Word Recognition argmaxw P w O argmaxw P O w P w Simplifying assumptions O H P H P O H Best H pronunciation string correct word boundaries known Problem Best H argmax P H O argmax P O H P H H H P O likelihood prior Given n iy what is correct dictionary word What do we need ni knee neat need new What is the most likely word given ni Compute prior P w Why N grams Compute likelihood P ni w then multiply Word freq w P w Word P O w P w P O …


View Full Document

UMD CMSC 723 - Intro to Computational Linguistics

Documents in this Course
Lecture 9

Lecture 9

12 pages

Smoothing

Smoothing

15 pages

Load more
Loading Unlocking...
Login

Join to view Intro to Computational Linguistics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Intro to Computational Linguistics and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?