DOC PREVIEW
SWARTHMORE CS 97 - Part Of Speech Tagging Using A Hybrid System

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Appeared in: Proceedings of the Class of 2003 Senior Conference, pages 23–29Computer Science Department, Swarthmore CollegePart Of Speech Tagging Using A Hybrid SystemSean FinneySwarthmore [email protected] AngelilloSwarthmore [email protected] procedure is proposed for tagging part ofspeech using a hybrid system that consists ofa statistical based rule finder and a genetic al-gorithm which decides how to use those rules.This procedure will try to improve upon an al-ready very good method of part of speech tag-ging.1 IntroductionThe tagging of corpora is an important issue that has beenaddressed frequently in computational linguistics for dif-ferent types of tags. Transformation-Based Learning(Brill, 1995) is a successful statistical method of tagging acorpus. It has been useful in part of speech tagging, wordsense disambiguation, and parsing among other things.This paper describes a hybrid method for tagging partof speech. The method employs an implementation ofBrill’s Transformation-Based Learning (TBL) as the sta-tistical part of the system, and a Genetic Algorithm whichtries to modify the transformation ruleset in a way thatwill produce better results.Part of speech taggers are very useful in modern Natu-ral Language Processing, and have potential applicationsin machine translation, speech recognition, and informa-tion retrieval. They are usually used as a first step on ablock of text, producing a tagged corpus that can thenbe worked with. Of course, the better the tagger is, themore accurate overall results will be. While TBL is avery successful part of speech tagger, it does still createsome errors. It might be possible to tag a corpus perfectly,and it might be possible to use TBL, slightly modified, toacheive that goal.During Transformation-Based Learning, transforma-tion rules are learned which will correct errors in an in-correctly tagged corpus. The incorrectly tagged corpus iscompared to the truth in order to learn these rules. Oncecond 1 transform xcond 2 transform ycond 3cond 5cond 3cond 1cond 4cond 2transform xtransform ztransform wtransform xtransform xtransform zFigure 1: A Sample Ruleset for TBLthe rules are learned, they are applied in an order deter-mined by a greedy search. The rules are applied to thecorpus to see how many errors they correct. The rule thatcorrects the most errors is chosen as the next rule in theruleset. This process is inherently shortsighted.The issue here is the greedy search. As rules are ap-plied, the number of errors corrected in the text is alwaysthe largest. However, the transformation rules do tend tocreate errors, even as they are fixing errors. This processassumes that those errors are unimportant, or that theywill be corrected later on in the ruleset. This assumptionis made by TBL, and we are hoping to fix this potentialproblem. With a careful reordering of the rules, it mightbe possible to create a tagged corpus without any errors.2 Related WorkOther workers in the field have produced non-statisticalsystems to solve common NLP problems. There appears23Appeared in: Proceedings of the Class of 2003 Senior Conference, pages 23–29Computer Science Department, Swarthmore Collegeto be a number of methods that produce varying results.The Net-Tagger system (Schmid, 1994) used a neural net-work as its backbone, and was able to perform as well asa trigram-based tagger.A group working on a hybrid neural network tagger forThai (et al., 2000) was able to get very good results on arelatively small corpus. They explain that while it is easyto find corpora for English, languages like Thai are lesslikely to have corpora on the order of 1,000,000 words.The hybrid rule-based and neural network system theyused was able to reach an accuracy of 95.5(22,311 am-biguous word) corpus. It seems that the strengths of rule-based statistical and linguistic methods work to counter-act the inadequacies of a system like a neural network,and vice versa. It seems logical to combine the learningcapabilities of an AI system with the context based rulecreation capabilities of a statistical system.3 Genetic AlgorithmsThe Genetic Algorithm (GA) takes its cue directly frommodern genetic theory. This method entails taking aproblem and representing it as a set of individual chro-mosomes in a population. Each chromosome is a stringof bits. The GA will go through many generations ofchromosomes. One generation consists of a fitness se-lection to decide which chromosomes can reproduce, re-production itself, and finally a chance for post reproduc-tion defects. The GA generates successive populationsof chromosomes by taking two parent chromosomes se-lected by a monte carlo approach and applying the basicgenetic reproductionfunction to them. The chromosomesare ranked by a fitness function specific to the problem.The monte carlo approach assures that the two best chro-mosomes are not always chosen.The reproduction function is crossover,inwhichtwochromosome parents are sliced into two pieces (See Fig-ure 5 in the appendix). The pieces each take one of thepieces from the other parent. The resulting two chromo-some children have part of each of the parents1.After reproduction, there is a chance that mutation willoccur, in which some part of the child chromosome israndomly changed (see Figure 6 in the appendix).When selection, reproduction, and mutation have fin-ished, the generation is completed and the process startsagain. The genetic algorithm is a good way to try manyprobable solutions to a problem. After a few generationsof a genetic search, it is likely that the solution generatedwill already be very good.TextTBL RulesetCompilableChromosomeGAFigure 2: A Sample Flowchart for Our System4 Representing the ProblemGiven that we are trying to produce the optimal set ofrules, and the order in which to use them, our chromo-somes are best represented as a set of rules in order. Thefirst parent from which all successive generations grow iscollected directly from TBL. We let TBL find the rulesthat it finds useful, trim that set of rules to a satisfactorylength, and run that through as the first parent of our GA.One issue that we ran across was how we would repre-sent each TBL rule as a bitstring. A rule in TBL consistsof a series of predicates acting on the words before andafter the word in question, and a trigger that changes thetagged part of speech on the word if all of the predicatesare met. A predicate can be either a matching of a part


View Full Document

SWARTHMORE CS 97 - Part Of Speech Tagging Using A Hybrid System

Documents in this Course
Load more
Download Part Of Speech Tagging Using A Hybrid System
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Part Of Speech Tagging Using A Hybrid System and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Part Of Speech Tagging Using A Hybrid System 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?