Unformatted text preview:

Spelling Correction and the Noisy Channel The Spelling Correc on Task Dan Jurafsky Applica ons for spelling correc on Word processing Phones Web search 2 Dan Jurafsky Spelling Tasks Spelling Error Detec on Spelling Error Correc on Autocorrect hte the Suggest a correc on Sugges on lists 3 Dan Jurafsky Types of spelling errors Non word Errors gra e gira e Real word Errors Typographical errors three there Cogni ve Errors homophones piece peace too two 4 Dan Jurafsky Rates of spelling errors 26 Web queries Wang et al 2003 13 Retyping no backspace Whitelaw et al English German 7 Words corrected retyping on phone sized organizer 2 Words uncorrected on organizer Soukore MacKenzie 2003 1 2 Retyping Kane and Wobbrock 2007 Gruden et al 1983 5 Dan Jurafsky Non word spelling errors Non word spelling error detec on Any word not in a dic onary is an error The larger the dic onary the be er Non word spelling error correc on Generate candidates real words that are similar to error Choose the one which is best Shortest weighted edit distance Highest noisy channel probability 6 Dan Jurafsky Real word spelling errors For each word w generate candidate set Find candidate words with similar pronuncia ons Find candidate words with similar spelling Include w in candidate set Choose best candidate Noisy Channel Classi er 7 Spelling Correction and the Noisy Channel The Spelling Correc on Task Spelling Correction and the Noisy Channel The Noisy Channel Model of Spelling Dan Jurafsky Noisy Channel Intui on 10 Dan Jurafsky Noisy Channel We see an observa on x of a misspelled word Find the correct word w w argmax P w x w V P x w P w argmax w V P x argmax P x w P w 11 w V Dan Jurafsky History Noisy channel for spelling proposed around 1990 IBM Mays Eric Fred J Damerau and Robert L Mercer 1991 Context based spelling correc on Informa4on Processing and Management 23 5 517 522 AT T Bell Labs Kernighan Mark D Kenneth W Church and William A Gale 1990 A spelling correc on program based on a noisy channel model Proceedings of COLING 1990 205 210 Dan Jurafsky Non word spelling error example acress 13 Dan Jurafsky Candidate genera on Words with similar spelling Small edit distance to error Words with similar pronuncia on Small edit distance of pronuncia on to error 14 Dan Jurafsky Damerau Levenshtein edit distance Minimal edit distance between two strings where edits are Inser on Dele on Subs tu on Transposi on of two adjacent le ers 15 Dan Jurafsky Words within 1 of acress 16 Error Candidate Correct Error Type Correc on LeRer LeRer acress actress t dele on acress cress a inser on acress caress ca ac transposi on acress access c r subs tu on acress across o e subs tu on acress acres s inser on acress acres s inser on Dan Jurafsky Candidate genera on 80 of errors are within edit distance 1 Almost all errors within edit distance 2 Also allow inser on of space or hyphen thisidea this idea inlaw in law 17 Dan Jurafsky Language Model Use any of the language modeling algorithms we ve learned Unigram bigram trigram Web scale spelling correc on Stupid backo 18 Dan Jurafsky Unigram Prior probability Counts from 404 253 213 words in Corpus of Contemporary English COCA word actress P word 9 321 0000230573 cress 220 0000005442 caress 686 0000016969 access 37 038 0000916207 across 120 844 0002989314 acres 19 Frequency of word 12 874 0000318463 Dan Jurafsky Channel model probability Error model probability Edit probability Kernighan Church Gale 1990 Misspelled word x x1 x2 x3 xm Correct word w w1 w2 w3 wn P x w probability of the edit 20 dele on inser on subs tu on transposi on Dan Jurafsky Compu ng error probability confusion matrix del x y count xy typed as x ins x y count x typed as xy sub x y count x typed as y trans x y count xy typed as yx Inser on and dele on condi oned on previous character 21 Dan Jurafsky Confusion matrix for spelling errors Dan Jurafsky Genera ng the confusion matrix Peter Norvig s list of errors Peter Norvig s list of counts of single edit errors 23 Dan Jurafsky Channel model Kernighan Church Gale 1990 del wi 1 wi if deletion count wi 1 wi ins wi 1 xi if insertion count wi 1 P x w sub xi wi if substitution count w i trans w w count wi wi 1 if transposition i 24 i 1 Dan Jurafsky Channel model for acress Candidate Correct Error x w Correc on LeRer LeRer 25 P x word actress t c ct 000117 cress a a caress ca ac ac ca 00000164 access c r r c 000000209 across o e e o 0000093 acres s es e 0000321 acres s ss s 0000342 00000144 Dan Jurafsky Noisy channel probability for acress Candidate Correct Error x w Correc on LeRer LeRer 26 P x word P word 109 P x w P w 0000231 2 7 actress t c ct 000117 cress a a caress ca ac ac ca 00000164 access c r r c 000000209 0000916 019 across o e e o 0000093 000299 2 8 acres s es e 0000321 0000318 1 0 acres s ss s 0000342 0000318 1 0 00000144 000000544 00078 00000170 0028 Dan Jurafsky Noisy channel probability for acress Candidate Correct Error x w Correc on LeRer LeRer 27 P x word P word 109 P x w P w 0000231 2 7 actress t c ct 000117 cress a a caress ca ac ac ca 00000164 access c r r c 000000209 0000916 019 across o e e o 0000093 000299 2 8 acres s es e 0000321 0000318 1 0 acres s ss s 0000342 0000318 1 0 00000144 000000544 00078 00000170 0028 Dan Jurafsky Using a bigram language model a stellar and versatile acress whose combination of sass and glamour Counts from the Corpus of Contemporary American English with add 1 smoothing P actress versatile 000021 P whose actress 0010 P across versatile 000021 P whose across 000006 P versatile actress whose 000021 0010 210 x10 10 P versatile across whose 000021 000006 1 x10 10 28 Dan Jurafsky Using a bigram language model a stellar and versatile acress whose combination of sass and glamour Counts from the Corpus of Contemporary American English with add 1 smoothing P actress versatile 000021 P whose actress 0010 P across versatile 000021 P whose across 000006 P versatile actress whose 000021 0010 210 x10 10 P versatile across whose 000021 000006 1 x10 10 29 Dan Jurafsky Evalua on Some spelling error test sets 30 Wikipedia s list of common English misspelling Aspell ltered version of that list Birkbeck spelling error corpus Peter Norvig s list of errors includes Wikipedia and Birkbeck for training or tes ng Spelling Correction and the Noisy Channel The Noisy Channel Model of Spelling Spelling Correction and the Noisy Channel Real Word Spelling Correc on Dan Jurafsky Real word spelling errors leaving in …


View Full Document

Stanford CS 124 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?