Modular Approach to Error Analysis and Evaluation for Multilingual Question Answering Hideki Shima, Mengqiu Wang, Frank Lin, Teruko Mitamura Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave, Pittsburgh, PA 15213 USA {hideki, mengqiu, frank+, teruko}@cs.cmu.edu Abstract Multilingual Question Answering systems are generally very complex, integrating several submodules to achieve their result. Global metrics (such as average precision and recall) are insufficient when evaluating the performance of individual submodules and their influence on each other. In this paper, we present a modular approach to error analysis and evaluation; we use manuallyconstructed, goldstandard input for each module to obtain an upperbound for the (local) performance of that module. This approach enables us to identify existing problem areas quickly, and to target improvements accordingly. 1. Introduction In this paper, we present a new approach for the evaluation of Multilingual Question Answering (MLQA) systems. Our focus is the JAVELIN MLQA system for factoid questions, which integrates m ultiple modules in a sequential pipeline with no backtracking or dynamic planning (Lin et al., 2005). The system requires complex integration of several modules. In order to evaluate our system, we analyzed the performance of each module on the evaluation data from the NTCIR CLQA1 task 1 . We created goldstandard data (perfect input) for each module, in order to establish perfor mance upper bounds for each module. Our analysis allows us not only to identify several research issues, but also to compare the performance of our system acr oss different languages (EnglishChinese and EnglishJapanese) on a permodule basis. For evaluating the performance of the sam e system handling different languages, modular analysis is also useful in identifying languagespecific issues in individual modules. Since our evaluation focuses on the performance of the system, it is a form of informationbased evaluation rather than utilitybased or architectural evaluation (Nyberg & Mitamura, 2002). We adopted a fullyautomatic, informationbased approach to support regular batch evaluation during development and m aintenance of the system. 2. Javelin Architecture Our JAVELIN MLQA system consists of five modules: Question Analyzer (QA), Translation Module (TM), Retrieval Strategist (RS), Information eXtractor (IX) and Answer Generator (AG). Input question sentences in English are processed by these modules in the order listed above. The an swer candidates are return ed in one of the two languages (Japanese and Chin ese) as final outputs. The QA module is responsible for parsing the input question, choosing the expected answer type, and producing a set of keywor ds. The QA module calls the 1 http://www.slt.atr.jp/CLQA/ TM module, which translates the keywords into the language(s) r equired by the task. We use a combination of machine translation (MT) approaches for translating keywords: webbasedMT, dictionarybasedMT and textminingbasedMT. The system selects the combination of translated keywords which are most likely to cooccur. Subsequently, translated keywords are passed to the RS module in order to retrieve a ranked list of relevant documents. Given these documents, the IX module extracts answer candidates and a ssigns confidence scores to each candidate. Finally, the AG module normalizes and clusters the answers, and attempts to boost the ranks of the m ost probable answer candidates. The overall architecture is shown in Figure 1. Figure 1: System Architecture 3. Result and Analysis In addition to evaluating the overall performance of our system (e.g., by measur ing average answer precision), we performed evaluations on a permodule basis in order to identify and analyze specific fai lure points. We used the formal run dataset from NTCIR task CLQA1, which includes EnglishChinese (EC) and EnglishJapanese (EJ) subtasks. 200 in put questions were provided for each of the subtasks.1143Gold Standard Input QAATYPE Accuracy TM Accuracy RS Top15 IXTop100 MRR Overall Top1 R Top1 R+U None 86.5% 69.3% 30.5% 30.0% 0.130 7.5% 9.5% TM 86.5% 57.5% 50.0% 0.254 9.5% 20.0% TM+QAATYPE 57.5% 50.5% 0.260 9.5% 20.5% EC TM+QAATYPE+RS 63.0% 0.489 41.0% 43.0% None 93.5% 72.6% 44.5% 31.5% 0.116 10.0% 12.5% TM 93.5% 67.0% 41.5% 0.154 9.5% 15.0% TM+QAATYPE 68.0% 45.0% 0.164 10.0% 15.5% EJ TM+QAATYPE+RS 51.5% 0.381 32.0% 32.5% Table 1: Modular basis performance given gold standard input To eval uate the system’s output , we used the gold standard data from NTCIR, which includes correct answers and the documents where they came from. We distinguish “correct and wellsupported” answers from “correct but unsupported” answers, in the following way. We defin e documents in the gold standard dataset as supporting documents. Correct answers that came from supporting documents are deemed correct and supported (denoted by R), whereas correct answers that did not come from supporting documents are called unsupported answers (denoted by U). Let “top n frequency” be the frequency of th e event where at least one corr ect an swer was included in the top n answer candidates returned. And let “average top n accuracy” be an average of the top n frequency over the questions. Note that this metric does not evaluate th e number of correct answers returned for each question, whereas “average precision at n”, commonly used in the field of information retrieval, does for th e number of correct documents (Buckley & Voorhees, 2000). Following the evaluation method in the CLQA1 for mal run, we will use top 1 average accuracy as the metric for evaluating overall performance. In Table 1, the overall performance (top 1 average accuracy) is shown in the last t wo columns of the top rows for EC and EJ. If we examine only such global measures, we will not be able
or
We will never post anything without your permission.
Don't have an account? Sign up