Improving Bayesian Spelling CorrectionIntroductionRelated WorkDamerau 1964Pollock and Zamora 1984Kernighan, Church, and Gale 1990Church and Gale 1990Mays, Damerau, and Mercer 1990Kukich 1992Brill and Moore 2000Garfinkel, Fernandez, and Gopal 2003Bayesian InferenceMotivationImplementation DetailsTraining DataLikelihoodPriorLocalityCalculating sorted candidate correction listEvaluationResultsConclusionFuture WorkBibliographyImproving Bayesian Spelling CorrectionJacqueline BodineIntroductionThe problem of correcting a misspelled word to the word originally intended by the typist has interested many researchers. However, despite the research, the best spelling correctors still achieve below 96% accuracy in returning the correct word as the top candidate correction, though higher levels have been achieved for returning the correct word somewhere in a list of candidate corrections. These accuracy rates imply that any reasonable spell checker today must be interactive, that is to say, it must present the user with a list of possible corrections, and leave the final decision as to which word to choose to the user. Many hours of human labor could be saved if the accuracy rate for the top-candidate correction could be improved. With an improved accuracy rate, spell checkers could change from interactive to automatic, that is to say, they could run withouthuman intervention.In this project, I analyzed the different components of the Bayesian inference approach to spelling correction to determine the contribution of each component. I then attempted to improve the performance of the spelling corrector by improving the accuracy of the components. While the model I used as a starting point is not the most sophisticated model, hopefully the analysis and results from this project will still be relevant and provide guidance for researchers developing more sophisticated spelling correctors.Related WorkDamerau 1964Errors in text can be caused by either transcription errors or human misspellings. In an inspection of misspelled words, Damerau found that over 80 percent had only one spelling error, and that these single errors could be separated into four classes: insertion of an extra letter, deletion of a letter, substitution of an incorrect letter for a correct letter, or transposition of two adjacent letters.Using this information, Damerau designed a program to find the correct spelling of a word that has one error. The program detects misspelling by first comparing words of four to six characters in length to a list of high frequency words. If the word is not found in the list, it is searched for in the dictionary optimized by alphabetical order, word length, and character occurrence.If the word does not exist in the dictionary, the program attempts to find the correctly spelled version of the word. All words that differ in length by more than one or differ in character occurrence by more than 2 bit positions are not possible corrections under the one-error assumption, and so they are not checked. The remaining words are checked against the original word using spelling rules that eliminate or swap the positionswith differences and compare the rest of the positions in the word.If the dictionary word is the same length as the misspelled word, and the words differ in only one position, then the dictionary word is accepted because they differ by a single substitution. If the dictionary word and the misspelled word are the same length,and the words differ in only two adjacent positions, then if the characters are interchanged, and if they match, the dictionary word is accepted, because they differ by a single transposition. If the dictionary word is one character longer, then the first difference character in the dictionary word is discarded, the rest of the characters are shifted left and if they match, then the dictionary word is accepted, because they differ bya single deletion. If the dictionary word is one character shorter, then the first difference character in the misspelled word is discarded, the rest of the characters are shifted left, and if the words now match, the dictionary word is accepted because they differ by a single insertion.This algorithm was shown to correctly identify 96.4% of the words that were misspelled by a since spelling error. Of 964 spelling errors, 122 had more than one spelling error, 812 were correctly identified, and 30 were incorrectly identified. Along with this high accuracy rate, a strength of this algorithm is that it is very simple and efficient for detecting if a misspelled word and a dictionary word could be the same with one spelling error.This algorithm brings up two areas of concern. The first concern is that the algorithm scales linearly with the size of the dictionary because each dictionary word is checked against the misspelled word. Words of significantly different length or characters than the misspelled words are not checked in detail, but the length and characters are still checked. The second concern is that if there are multiple words withinone edit of the misspelled word, this algorithm will always select the first word alphabetically that occurs in the dictionary, and will never consider the other words.Pollock and Zamora 1984SPEEDCOP (Spelling Error Detection/Correction Project) is a program designed to automatically correct spelling errors by finding words similar to the misspelled word, reducing the set to words that could be created by a single spelling error, and then sorting the corrections by the probability of the type of error, and then the probability of the corrected word. The main part of the program works by hashing each word of the dictionary to a skeleton key and an omission key, and then creating dictionaries sorted by the keys. When a misspelled word is encountered, the skeleton key and omission key are created. Words that have keys that appear close to the key of the misspelled word are similar to the misspelled word, and are used as the set of possible corrections.The skeleton key is created by taking the first letter of the word, followed by all of the unique consonants in order, followed by the unique vowels in order. The structure of this key rests on the following rationale: (1) the first letter keyed is likely to be correct, (2) consonants carry more information than vowels, (3) the original consonant order is mostly preserved, and (4) the key is not altered by the doubling or undoubling of
View Full Document