Stanford CS 224N - EVOLUTIONARY APPROACH TO NATURAL LANGUAGE GRAMMAR INDUCTION - D2089172

Home> Schools> Stanford University> Computer Science (CS) > CS 224N> EVOLUTIONARY APPROACH TO NATURAL LANGUAGE GRAMMAR INDUCTION

Stanford CS 224N - EVOLUTIONARY APPROACH TO NATURAL LANGUAGE GRAMMAR INDUCTION

School name Stanford University

Course Cs 224n- Natural Language Processing

Pages 9

Download Save

Unformatted text preview:

AN EVOLUTIONARY APPROACH TONATURAL LANGUAGE GRAMMAR INDUCTIONMARGARET AYC INENAMYKEL J. KOCHENDERFERDAVID CARL MULFORDAbstract. This paper describes an approach for evolving natural languagegrammars using a genetic algorithm, based on part-of-speech tagged naturallanguage examples. Grammars are represented by the genetic algorithm asa finite string of non-terminals and pre-terminals corresponding to the rulesof a context-free grammar in Chomsky Normal Fo rm. Fitness is based onthe number of sentences correctly parsed by the grammar from a selectionof language examples, inversely related to the number of random sentencesparsed by the grammar, and discounted by the length of the string representingthe gram ma r. Our experimentation reveals that this evolutionary approachproduces grammars with high precision and recall, although they are dissimilarto grammars designed by humans.1. IntroductionGrammar induction, also known as grammatical inference [1], describes a processin which a system produces a grammar given a set of corpora. Given the complexityof natural language and the increasingly high availability of large bodies of text,automated development of grammars is and will continue to be an important areaof natural language processing and machine learning.Many previous grammar induction systems have worked on the assumption thatgrammar induction is a subset of inductive logic programming (ILP), and as such,have primarily depended on ILP techniques. Much less work has been done onusing evolutionary techniques to evolve natural language grammars.One of the few papers applying evolutionary techniques to grammar induction,Keller and Lutz [4], use d a genetic algorithm to evolve stochastic context-free gram-mars for finite languages. In this work, only positive examples of language datawere presented. The fitness function was based on the probability of a particulargrammar given the corpus, but the minimum description length principle was alsoincorporated so that simpler grammars would be preferred. Notably, the geneticalgorithm was applied not to evolve a grammar itself, but to evolve probability pa-rameter settings for a pre-chosen covering Chomsky Normal Form grammar. Andperhaps most importantly (for this work, at least), the languages of the corporawere formal; e.g. the language of all strings consisting of equal numbers of a’s andb’s, and palindromes over {a, b}.This paper presents some work that goe s significantly beyond what was donein [4]. First, the system actually evolves non-stochastic grammars, rather thanDate: June 2, 2003.Key words and phrases. grammar induction, genetic algorithm.12 MARGARET AYCINENA MYKEL J. KOCHENDERFER DAVID CARL MULFORDsimply the probability parameters for a pre-chosen grammar. Second, the systemuses both positive and negative examples of language data. Third, the fitnessfunction was based on the ability of a given grammar to parse the data. Finally,the corp ora were samples of part-of-speech tagged natural English language.2. MethodsIn this section, we describ e the evolutionary approach used to induce a grammarfor a part-of-speech tagged natural language corpus. Although the approach isbased on the idea of a genetic algorithm as originally presented in [3] and laterin [2], we made a few significant adaptations that make the process more suitablefor grammar induction.2.1. Genetic Algorithm. Chromosomes represent context-free grammars withvariable-length strings of non-terminals and pre-terminals. For example, the stringSABABCBCDCAEwould represent the following CFG:S → A BA → B CB → C DC → A EWe restrict the set of possible strings such that the left side of e ach produc-tion (i.e. every third character in the string) is a non-terminal. All of the othercharacters may be either pre-terminals or non-terminals.The population is organized on the surface of a torus, represented as a two-dimensional grid with opposite sides connected (similar to [4]). The initial popu-lation is generated randomly from a uniform distribution of chromosomes of somespecified length.The genetic algorithm then executes the following select-breed-replace cycle:(1) Select an individual randomly from the grid(2) Breed that individual with its mos t highly fit neighbor to produce twochildren(3) Replace the weakest parent by the fittest childIn this paper, we refer to a single iteration through this cycle as a generation—even though only one individual in the population is replaced.Given two parents, we create two children by first applying cross-over and thenprobabilistically applying mutation. Cross-over is accomplished by selecting a ran-dom production in each parent. Then a random point in these productions isselected and cross-over is performed, swapping the remainder of the strings afterthe cross -over points. Following cross-over, we iterate over each non-terminal andpre-terminal in the strings and apply mutation according to a specified probability.A mutation is simply the swapping of a non-terminal or pre-terminal with anothernon-terminal or pre-terminal.EVOLUTIONARY GRAMMAR INDUCTION 32.2. Parsing. At the core of the fitness function is the ability to parse a sentence.Since our grammar is in Chomsky Normal Form, we can use the Cocke-Kasami-Younger (CKY) algorithm. This algorithm is efficient (O(n3) performance) forgrammars of this typ e. The algorithm can be easily extended to handle probabilisticcontext-free grammars.Because speed is of the utmost importance, we have implemented the parser inC++. In order to make the parser as fast as p os sible, we have used the followingimplementation strategies:(1) When filling a square in the well-formed sentence table, we need to knowwhich symbols can produce the two child symbols. A 2D array is used tostore grammar rules. Entry (a, b) contains a vector of all parents c thatproduce a and b (all rules c → ab).(2) 2D arrays are simulated using a 1D array. If you have a pointer to an entry,you can increment it by 1 to advance to the next column, or incrementit by numcolumns to reach the next row. This eliminates the need formultiplication operations to index into the WFST when iterating over splitpoints. Also, the array’s size is forced to be a power of 2 s o that the initialindexing operation can be done with a bit-shift rather than a multiplication.(3) STL vectors are used rather than STL sets for each WFST entry. We needto iterate over a table entry much more than we nee d to insert, so thethe faster iteration

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 224N - EVOLUTIONARY APPROACH TO NATURAL LANGUAGE GRAMMAR INDUCTION

Sign up for free to view:

Please select your school