SWARTHMORE CS 97 - Grammar Checking using POS Tagging and Rules Matching - D1669792

Home> Schools> Swarthmore College> (CS) > CS 97> Grammar Checking using POS Tagging and Rules Matching

DOC PREVIEW

SWARTHMORE CS 97 - Grammar Checking using POS Tagging and Rules Matching

School name Swarthmore College

Course Cs 97- Computer Perception

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Appeared in: Proceedings of the Class of 2005 Senior Conference, pages 14–19Computer Science Department, Swarthmore CollegeGrammar Checking using POS Tagging and Rules MatchingZac RiderComputer Science DepartmentSwarthmore CollegeSwarthmore, PA [email protected] paper is an examination of varioustechniques that could be used for grammarchecking and the description of the resultsthat were generated using a simple rulesmatching system. To generate the rulesfor this system, two techniques were con-sidered: hand construction and an algo-rithm that randomly generates large num-bers of rules and uses comparison againstlarge corpora to find valid rules. While in-dividual construction of rules proved to beeffective for addressing specific errors, therandom algorithm proved to be effectivefor a larger number of grammatical errors.1 IntroductionThere’s something wrong with the sentence: Mi-crosoft company should big improve Word grammarcheck, but Word 2004 thinks that the only prob-lem is that company should be capitalized. Gram-mar checking is one of the more complicated tasksfor word processing, and the more irregular andexception-filled the language, the more difficult theproblem becomes. Problems such as a noun-verbmismatch: one of the mistakes are bad, or adjectivesincorrectly used as adverbs: I can’t read so good,are much easier to find than a somewhat ambiguousmistake such as: The badger was acted upon (pas-sive voice).The simplest method of fixing grammatical er-rors, which was used for the experiments for thisproject, is the process of rules matching, that is,constructing a rule that applies to a given gram-mar and then checking that the given input follows,or does not follow, that rule. Using lexigraphi-cally aided finite state machines is another, morecomplicated method, that combines a bootstrappedlearning algorithm with parsing and POS tagging(Sofkova Hashemi et al., 2003). Other methodsinclude syntactic analysis and parse tree analysis(Bender et al., 2004).One thing that differs in the methods of gram-mar checking systems is whether or not the systemis checking for negative or positive grammar. Intu-itively, it seems like it might be easier to define theproperties that are correct in a grammar, as there area set number of grammatical configurations that arecorrect and an infinite number of configurations thatare incorrect. The problem is that describing all ofthe correct configurations for a grammar checker re-quires that for every check, it must look at every sin-gle rule to see if a given example is in the grammar.This process is necessarily slower than a system thatuses a relatively few rules per check to see if some-thing is not in the grammar. Since speed is not ofgreat concern for the system in this paper, the ruleschecking could have been implemented either way,but for simplicity, we chose to implement rules thatcheck for specific errors in grammar instead of usinga model of correct grammar to find incorrect exam-ples. For a small system, it is easier to describe afew things in English that are grammatically incor-rect than every rule that is correct.14Appeared in: Proceedings of the Class of 2005 Senior Conference, pages 14–19Computer Science Department, Swarthmore College2 Related WorkOne approach to checking grammar relies on atechnique called aligned generation (Bender et al.,2004). However, this process is not used in the ev-eryday sort of grammar checking that might be usedin a word processor, rather it is a complicated pro-cess that takes a fair amount of time and is used forgenerating language learning systems. The systemtakes mal-rules and mal-lexical types and entriesgiven by the user and uses feature structure gram-mar analysis, which is an extensive search of mul-tiple parse trees for errors based on the given rules.The majority of the work in the system is the pars-ing itself, in which the input sentence is put into ev-ery possible configuration, and then those configu-rations are rated, and an acceptable configuration ischosen. One concern with this method is that theprocess of creating the parse trees for analysis is po-tentially time consuming.Finite state machine analysis has the interest-ing property of not being a rules based system,rather it is a bootstrapped learning system thatuses regular expressions along with FSMs to at-tempt to judge the correctness of lexically deter-mined phrases (Sofkova Hashemi et al., 2003). Thephrases generated by the system’s lexicon are stringsmapped to a tag containing part-of-speech and otherfeature information. While this method has 92% re-call, it only has approximately 45% precision. Thiscould prove cumbersome for a word processor sys-tem, as the user could be presented with many casesthat the checker flags as errors that are, in fact, cor-rect. However, for the task described in this pa-per, the recall percentage is acceptable. The randomrules generator described in this paper is an approx-imation of this type of analysis, but the system de-tailed in this paper has no lexigraphical aids.The system that this paper attempts to emulate isthe Granska rules matching system (Domeij et al.,1999), which makes a point of not using HiddenMarkov models and simply using what the authorcalls error rules to locate errors and helping rules toattempt to determine the best correction, and thus thebest fitting rule for a given error. The Granska sys-tem has precision and recall of approximately 80%for the problems that it was designed for, namelynoun-phrase disagreement and incorrectly split com-pounds in Swedish. The issue with the Granska sys-tem, however, is that while it has good results forthese two problems, it turns out that the methodsused in Granska do not translate well to all problemsin grammar.3 Parts of a Grammar CheckerA typical grammar checker that might be foundin a word processor consists of three differentpieces. First, a processor has to be able to sepa-rate the input into individual sentences. Then, itneeds a part-of-speech (POS) tagger that can ac-curately label the data that it has. Charniak hasan excellent analysis of POS tagging (Charniaket al., 1993) that is used by the makers of theGranska system. The particular POS tagger thatis used for this system was taken from the Stan-ford website http://www-nlp.stanford.edu/links/statnlp.html (Toutanova andManning, 2000; Toutanova et al., 2003) and worksin log linear time. The speed of this system sub-stantially speeds up training

View Full Document