Stanford CS 224N - Building a Question Classifier for a TREC-Style Question Answering System - D486065

Home> Schools> Stanford University> Computer Science (CS) > CS 224N> Building a Question Classifier for a TREC-Style Question Answering System

Stanford CS 224N - Building a Question Classifier for a TREC-Style Question Answering System

School name Stanford University

Course Cs 224n- Natural Language Processing

Pages 9

Download Save

Unformatted text preview:

Building a Question Classifier for a TREC-StyleQuestion Answering SystemRichard May & Ari SteinbergTopic: Question ClassificationWe define Question Classification (QC) here to be the task that, given aquestion, maps it to one of k classes, which provide a semantic constraint onthe sought-after answer [Li02]. The topic of Question Classification arises inthe area of automated question-answering systems, such as those created forthe TREC question answering competition. Automated question-answeringsystems differ from other information retrieval systems (ie. search engines)in that they do not return a list of documents with possible relevance to thetopic, but rather return a short phrase containing the answer. Moreover,question answering systems take in as input queries expressed in naturallanguage rather than the keywords traditional search engines use.In order to respond correctly to a free form factual question given a largecollection of texts, any system needs to understand the question to a levelthat allows determining some of the constraints the question imposes ona possible answer. These constraints may include a semantic classificationof the sought after answer and may even suggest using different strategie swhen looking for and verifying a candidate answer. More specifically, know-ing the class (or possible classes) of the sought after answer narrows downthe number of possible phrases/paragraphs a question-answering system hasto consider, and thus greatly improves performance of the overall system.[Harabagiu] divides their QA system into three distinct pieces. At the coreof the first module in the system lies a question classification task. Thus,it seems that question classification is an important subtask of automatedquestion answering. An error in question classification will almost undoubt-edly throw off the entire QA pipeline, s o it is crucial to be able to classifiyquestion correctly.In our paper, we will build on the hierarchal classification discussed in [Li02]and e xperiment with some features of our own design. We expect that bytweaking both the classification algorithms and the choice of features, wecan get improvements in this crucial QA subsystem.1ClassificationClassification MethodologyResults from [Li02] demonstrated that using a flat classifier performs justas well as a two-layer hierarchical classifier that used a coarse classifier todispatch the classification task to a second classifier. We too, plan on usinga hierarchal classifier, however, ours will differ in that we will also attemptto learn which classes are often confused, whereas [Li02] applied domainknowledge to the task and created six hand crafted super classes that thencontained distinct subsets of the true classes.Classification SystemFor our code base, we leverage an existing machine learning library, MAL-LET, discuss ed in [McCallum]. We designed a hiearchal classifier trainer,which takes in a training set and partitions it into a base train set and anadvanced train set. The trainer then uses the base train set to train thecoarse classifier over all the possible question types. The trainer then teststhe coarse classifier on the advanced training data, to build a confusionmatrix. Using a set of threshold parameters, the trainer decides if certainpredicted classes have too high a confusion rate and then trains a secondaryclassifier on the advanced training instances that were predicted to be partof the high confusion rate classes.Figure 1: Diagram of how to train a hierarchal classifierFeaturesThe majority of our creative energies were focused into feature engineering.It is a process that requires a lot of trial and error. We intend to leverage2resources in WordNet to improve on semantic understanding of the ques-tions.As input to our machine learning algorithm, the computer examines eachquestion and derives a representation consisting of numerous features. Inthe end, a typical question can have as many as 60 features, and our set of6000 total questions can result in anywhere from 30,000 to 120,000 uniquefeatures. While it would be easy to generate far more than this, due tomemory constraints we must be careful in selecting only the most usefultypes of features to include in this set. In addition, adding extraneousfeatures can add noise to the data and result in weaker performance. Onthe other hand, large performance improvements can be gained by addinguseful features to the set.Basic featuresThe most basic representation of a question is simply the individual wordsof that question (ignoring contextual information such as the ordering ofthe words). While simple, this is also by far the most important part of ourprogram - the best indicators of certain question types are single words, andin particular question words clearly reveal a lot about the type of questionbeing asked. On the other hand, a large amount of information would belost by stopping here, since words can often mean many different thingsdepending on their context.We can regain some of this contextual information by examining part ofspeech tags. We run a parser over the question and take the preterminalnodes of the parser as the parts of speech for each word. These parts ofspeech alone wouldn’t help much, so we add as features word and part ofspeech pairs, thus helping to disambiguate words which have different sensesdepending on their part of speech.Another problem that part of speech tags fail to address is feature spar-sity (in fact, part of speech and word pairs suffer even further from thisproblem). Some words can reveal a lot of information, but do not show upenough times in the training set to allow our classifier to pick up on this. Toaddress this problem, we use a stemmer to create a more generalized formof the word. For example, whether a verb is in present or past tense mightnot impact how it affects the question being asked.Another basic feature that we add is bigrams - pairs of words that occursequentially. However, bigrams inherently suffer even worse from sparsityproblems than individual words, so we use the stemmed form of the wordsto create the bigrams.3One final basic feature that we experimented with is to take conjunctionsof any other two basic features. Bigrams do not capture long-distance de-pendencies - two words that may affect each other’s meanings but are notadjacent in the sentence, so we attempted to do that with these conjunc-tions. The conjunctions did help to improve accuracy, but they resulted

View Full Document