CORNELL CS 674 - Base Noun Phrase Chunking with Support Vector Machines - D498411

Home> Schools> Cornell University> Computer Science (CS) > CS 674> Base Noun Phrase Chunking with Support Vector Machines

DOC PREVIEW

CORNELL CS 674 - Base Noun Phrase Chunking with Support Vector Machines

School name Cornell University

Course Cs 674- Advanced Language Techologies

Pages 16

This preview shows page 1-2-3-4-5 out of 16 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Base Noun Phrase Chunking with Support Vector Machines Alex Cheng CS674: Natural Language Processing – Final Project Report Cornell University, Ithaca, NY [email protected] Abstract We apply Support Vector Machines (SVMs) to identify base noun phrases in sentences. SVMs are known to achieve high generalization performance even in high dimensional feature space. We explore two different chunk representations (IOB and open/close brackets) and use a two-layer system approach for the classification task. Experiments show that despite using a dynamic programming approach to find the most probable pairing in the open/close brackets representation, the IOB representation performs significantly better. In addition, using higher degree polynomial kernel functions lead to slightly better results. We conclude that SVMs are extremely powerful machine learning approach for many natural language processing tasks and outperforms other learning systems because of SVMs’ ability to generalize in high dimension. 1. Introduction Base noun phrase (baseNP) chunking involves dividing sentences into non-overlapping segments of noun phrases. Calculating these noun phrase chunks are usually relatively computationally inexpensive and are often used as a precursor full parsing and further semantic analysis such as markings in noun-phrase co-reference. In addition, noun phrase chunks are used as multi-word indexing terms and are important for information retrieval and information extraction task. Support Vector Machine (SVM) is a relatively new statistical machine learning approach for solving binary classification problem. Essentially, SVMs maximize the margin between critical training examples by solving a dual optimization problem. Kernel functions allow SVMs to combine the input features at relatively low computational cost and transform SVMs to non-linear classifiers. Kudo and Matsumoto (2001) apply SVMs to NP chunking and achieve higher precision and recall in the standard data set than previously reported. They use a boosting method of weighted voting between SVMs trained with different chunk representations to identify noun phrase chunks. In this paper, we apply SVMs to the NP chunking task, by using different system architectures from Kudo and Matsumoto. We introduce a two-layer system of SVMs that iterate in its classification stage until convergence. We examine the effect of higher degree polynomial kernel function on the chunk task. Moreover, we apply SVMs to a open/close brackets chunk representation. We analyze the effect of different parameters used for combining the brackets to find the most probably noun phrase chunks..2 1.1 Description Task The definition of noun phrases can be ambiguous. In this work, we follow the definition proposed by Ramshaw and Marcus (1995) who develop a method for deriving and extracting NP chunks from the Penn Treebank. The goal of NP chunking is to identify the initial portions of non-recursive, non-overlapping noun phrases, including determiners but excluding postmodifying prepositional phrases or clauses. The bracketed portions of Figure 1, for example, show the base NPs. 2. Support Vector Machines SVM is a relatively new machine learning approach based on statistical learning theory. SVMs are well-known for their good generalization performance and have been applied to many recognition problems. Recently, SVMs have been applied to natural language processing tasks such as chunking (Kudo & Matsumoto, 2001) and text classification (Joachims, 2002). In particular, these two specific NLP systems are reported to have achieved higher accuracy than most state of the art systems (both learning and knowledge-based approaches). There are theoretical and empirical results that indicate the good performance of SVMs’ ability to generalize in a high dimensional feature space without over-fitting the training data (Joachims, 2002). 2.1 Optimal Hyperplane Essentially, SVM is a linear classifier for solving binary classification problem. We will first introduce the linear Hard Margin SVM which operates under the assumption that the training examples are linearly separable. We then generalize it to a Soft Margin SVM which we use for our work in NP chunking. We can define the problem as follows: We are given a training set S containing examples xi, a feature vector of dimension d, each xi belong either a positive negative class: ),(),...,,(11 nnyxyx dx ℜ∈1 }1,1{1−+∈y A Hard Margin SVM seeks to (1) find an optimal separating hyperplane h that classify all training examples correctly where h = w • x + b and w is the orthogonal vector and b is the offset from the origin. We can measure the optimality of the hyperplane by the margin, the minimum distance from h to the training example of both classes. Intuitively, the a hyperplane h with a larger margin has the highest generalization power. It is not hard to show that the minimum distance between h and the training example is 1 / ||w||1. 1 (w • x1) + b = 1 (w • x2) + b = -1 [ Pierre Vinken ], [ 61 years old], will join [ the board ] as [ a nonexecutive director ] [ Nov. 29 ]. Figure 1: Base NP Example (sentence extracted from WSJ)3Therefore, by minimizing w, we maximize the margin. So Hard-Margin SVM seeks w and b such that Minimize: ww •21 Subject to: 1][:1≥+•∀=bxwyini Hard Margin SVMs find a hyperplane that separate the training examples with perfect accuracy. However, a Hard-Margin SVM does not usually work in practice because of noise in the training data (see Figure 2) or a separating hyperplane may not exist. Therefore, a Soft Margin SVM allows for such training noise by introducing an error parameter C. The error for each training example iξ is measured by the distance between the example and the margin of the example’s class (see Figure 3). So the optimization problem becomes the following: Minimize: ∑=+•niiCww121ξ Subject to: iinibxwyξ−≥+•∀=1][:1 0:1>∀= iniξ Clearly, as C → ∞ , we have the essentially a Hard-Margin SVM since any error will cost the minimizing term to approach infinity. So the parameter C controls how much “slack” we give to the SVM which can be adjusted based on knowledge of the learning task. For computational reasons, it is easier to solve the Wolfe dual of the optimization problem. This can be derived using Lagrange multipliers2. This optimization problem can be solved by

View Full Document