DOC PREVIEW
Stanford CS 224 - Study Notes

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS224NSpring 2010High-Performance Question Classification Using Semantic FeaturesOlalere WilliamsStanford [email protected] question classification system uses machine learning techniques to classify a question by the type of answer it requires. Recent systems have successfully used syntactic parsing to identify small, highly-informative phrases or words within a question to use as the principle features of a probabilistic model. In this paper, I explore the effectiveness of an alternative feature set that makes use of only semantic features. My system achieves a classification accuracy of 86.6% on the standard UIUC benchmark test set (Li and Roth, 2002), making it superior to all but the most recent classification system (Huang et al., 2008).1 IntroductionIn a seminal paper in this area of natural language processing, Li and Roth (2002) used a diverse feature set (consisting of both syntactic features such as POS tags and semantic features such as named-entities) to achieve the baseline performance of 84.2% on a dataset they assembled that has now become standard. In addition to the UIUC data set that they published, they also defined a common classification taxonomy of 50 fine classes and 6 coarse classes that systems could target. Reporting system classification accuracy against the set of fine classes has become the standard evaluation metric in this area.Krishnan et al. (2005) more recently contributed the concept of an informer, a short (usually one to three word) contiguous phrase within the question that can be used to accurately classify it. They adopted a meta-learning framework in which they first trained a sequence 1CS224NSpring 2010model to classify informers, and then combined features of the predicted informer with more general features into a single large feature vector that was fed into an linear support vector machine (SVM) to classify the overall question. Informers were identified using features derived from a parse of the input question. This approach achieved an overall question classification accuracy of 86.2% (while identifying informers with 85% accuracy). Finally, Huang et al. (2008) derived features from the head-words of the principal noun phrases in a question (such as WordNet (Fellbaum, 1998) hypernyms, which were first introduced as promising features by Krishnan et al., 2005). This approach was seen as a refinement of the concept of an informer (which occasionally included too many words, thus polluting the combined feature vector with misleading features). At the time of writing, this system achieves the highest published performance of 89.2% question classification accuracy.In this paper, I explore an alternative feature set that is based purely on semantic features. The inspiration for this strategy comes from Li and Roth's (2002) observation that semantic features provided the greatest benefit for the overall classification system. Initially I set out to investigate to what extent semantic features could be used in isolation to create a question classifier with comparable performance to the baseline established by Li and Roth (2002). As it turns out, combining a rich set of semantic features with Krishnan et al.'s (2005) notion of an informer results in a system that is second only to Huang et al.'s (2008) most recent system.2 MethodologyMy overall system consists of a maximum entropy Markov model used as an informer tagger, and a maximum entropy classifier used to classify overall questions based on a large feature vector which combines informer features with features of all words in the question (informer words or otherwise).2.1 Informer taggerFollowing Krishnan et al. (2005), I build sequence model to indentify informers as an intermediary step to classifying the overall question. I chose to pursue this method because I thought a reasonable classification of informers could be achieved using only semantic features, whereas the more successful approach of Huang et al. (2008) seems to necessitate a syntactic parse of the input question. Training data for the informer tagger was obtained by hand-labelling 2000 questions from the standard training set. Similarly, for test data I hand-labelled the 500 questions that form the standard test set for the question classification problem.2CS224NSpring 2010My feature set for the informer tagger is compact and effective. For each word in the question (i.e. for each member of the sequence of words that will eventually be tagged with one of three possible labels – O1 for tokens that come before the informer, I for informer tokens and O2 for tokens that come after the informer), I use as basic features: (1) the word itself, (2) the words immediately preceding and following it (if they exist), (3) whether or not the previous word was a “question word” (one of “what”, “which” or “how”), (4) the previous label and (5) whether or not the question as a whole consists of six tokens or less. I then also model various interactions between these features by hand (as is required by a maximum entropy Markov model).All of these features are highly intuitive, except perhaps the last feature. I included this after realizing that definitional questions occur frequently in the training and test sets, and that this class of question can easily be identified by the relative brevity of those questions. Moreover, I think that it is highly plausible that real question answering systems would also face a large number of short, definitional questions. This means that far from being an ad hoc feature that exploits idiosyncrasies of the training and test sets, I believe this feature accurately models an important class of question.The incremental gains secured by each feature are shown in Table 1. I follow Krishnan et al. (2005) in reporting Jaccard overlap as the principal measure of informer tagger performance (see that paper for justification), but I also include more common evaluation metrics. Unfortunately, however, they did not make available the dataset they used for training and testing the informer tagger. Since our datasets were annotated differently, the numbers here are incommensurable to Krishnan et al.'s (2005) figures.As you can see from Table


View Full Document

Stanford CS 224 - Study Notes

Documents in this Course
Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?