1 Billion Pages = 1 Million Dollars? Mining the Web to Play “Who Wants to be a Millionaire?”OUTLINEINTRODUCTIONCharacteristics of QuestionsThe PlayerTHE QA MODULENaive Approach: CountingSimple Query ModificationsSimple Query Modifications (cont’d.)Word Proximity MeasuresWord Proximity Measures (cont’d.)Slide 12Noun-Phrase ProximityCombining StrategiesCombining Search EnginesCombining Search Engines (cont’d.)Slide 17Confidence-Based Weight AssignmentsConfidence-Based Weight Assignments (cont’d.)Slide 20Slide 21Slide 22Overall PerformanceChoosing Good WeightsSample “Problem” QuestionsSample “Problem” Questions (cont’d.)THE DM MODULETHE DM MODULE (cont’d.)Slide 29Modeling the GameModeling the Game (cont’d.)Playing the Game: ResultsPlaying the Game: Results (cont’d.)Slide 34Slide 35Slide 36Hard Questions Easy, Easy Questions HardSix Questions to HumanSix Questions to Human (cont’d.)CONCLUSIONSDISCUSSION QUESTIONS1 Billion Pages = 1 Million Dollars?Mining the Web to Play “Who Wants to be a Millionaire?”Presented byBhavana TapdeByShyong Lam, David Pennock, Dan Cosley, Steve LawrenceCS791 – Technologies of GoogleMarch 21, 2007OUTLINEIntroductionPlaying MillionaireQuestion-Answering (QA) ModuleDiscussion: QA ModuleDecision-Making (DM) ModuleDiscussion: DM ModuleConclusionINTRODUCTION“Who Wants To Be A Millionaire?” – very popular ABC TV game show.http://www.millionairetv.com/game/active/frame_game.htmlPurpose: Utilize the redundancy and volume of information available on the web, and build a computerized player for the game show.The player written by the authors, is based on a home version of the game: the Millionaire CD-ROM, 3rd edition.Characteristics of QuestionsMillionaire CD-ROM game contains 635 questions.Questions are categorized into seven difficulty levels. The lower difficulty levels contain more common sense and common knowledge questions.To examine algorithms and tuning parameters, three random 90-question samples and one random 180-question sample are used.PLAYING MILLIONAIREThe Player The Millionaire player consists of two main componentsQuestion-Answering (QA) module for multiple-choice questions and Decision-Making (DM) modulePLAYING MILLIONAIRETHE QA MODULEPurpose: Make use of text corpora on the web, and different strategies to find correct answer for question, out of four choices.Uses–World Wide Web as data source–Several search engines (most prominently Google)Naive Approach: CountingQuery Google with the question along with each of the four answers. Correct answer? – one that produces the highest number of search results. “inverted” questions – the answer is the one that is unlike the other three. Can be identified by the word “not” in the question.The strategy answers about 50% of the questions correctly.QA MODULESimple Query ModificationsQuery transformations and modifications increase the percentage of correct responses to 60%.Enclose multiple-word answers in quotes to appear them as a phrase in any search results.For “Complete a saying” questions, construct each possible saying from choices and it should appear in the search results.QA MODULESimple Query Modifications (cont’d.)If query returns no results for any of the answers, use series of “fallback” queries that progressively relax the query.Longer web pages may contain less useful information, e.g. lists of links, essays, etc. Search engine results not restricted on page size. First-order approximation - exclude .pdf files from results.QA MODULEWord Proximity MeasuresKey idea - Not only answers appear in the same documents as questions, but also they usually appear near the question words.Download first 10 pages returned by Google for each query.Score each document based on a heuristic DistanceScore. Each question word has a score between 0 and 1 depending on how close the word is from the answer word.Use the average score per answer word in the document to further filter the documents.QA MODULEWord Proximity Measures (cont’d.)Table 1*: Pseudocode for DistanceScore, our proximity scoring method for favoring question words that appear near (within rad words of) answer words.// wordList is the document split at spacesDistanceScore(wordList, qWords, aWords, rad) score, answerWords = 0 for i = 1 to |wordList| do if wordList[i] is in aWords then answerWords = answerWords + 1 for j = (i-rad) to (i+rad) do if wordList[j] is in qWords thenscore += (rad - abs(i-j)) / rad if answerWords == 0 then return 0 else return score / answerWords* All tables and figures are taken from the paper.QA MODULEWord Proximity Measures (cont’d.)Figure 1: Question-answering accuracy versus proximity radius when using DistanceScore, as compared to the naïve method on three 90-question samples. Each line represents performance on one sample.QA MODULENoun-Phrase ProximityIdentify noun phrases using heuristics based on Brill’s Part-of-Speech tagger.Submit each {noun-phrase, answer} pair to Google and score the results the same way as before.Naive method produces poor results.DistanceScore produces results comparable to the previous two strategies.QA MODULECombining StrategiesAmong the naive, DistanceScore based on naive, and DistanceScore based on noun phrase strategies, at least one has the correct answer for about 85% of the questions in a 180-question sample.Combine three strategies, and produce a single score for each possible answer.ci = ws * (Si / max {S1..n}) over all strategies ci – combined score for answer i ws – weight for strategy S Si – score for strategy S for answer i n – number of candidate answersPerformance increases to 70%.QA MODULECombining Search EnginesUse of multiple search engines improves results.Modify each of the three strategies to submit queries to AllTheWeb, MSN Search, and AltaVista, using syntax appropriate for each engine.Combine scores using the same formula as before.QA MODULECombining Search Engines (cont’d.)engine naive proxim phr prox combinedGoogle 55.6% 55.0% 68.9% 70%AllTheWeb 56.1% 51.7% 58.3% 66%MSN 44.4% 48.9% 47.2% 58%AltaVista 46.7% 55.6% 56.1% 68%Google performs better than the other engines.Table 2: Performance of the three strategies using different search engines.QA MODULECombining Search Engines (cont’d.)Combine the
View Full Document