DOC PREVIEW
ODU CS 791 - Lecture Notes

This preview shows page 1-2-3-19-20-39-40-41 out of 41 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1 Billion Pages = 1 Million Dollars? Mining the Web to Play “Who Wants to be a Millionaire?”OUTLINEINTRODUCTIONCharacteristics of QuestionsThe PlayerTHE QA MODULENaive Approach: CountingSimple Query ModificationsSimple Query Modifications (cont’d.)Word Proximity MeasuresWord Proximity Measures (cont’d.)Slide 12Noun-Phrase ProximityCombining StrategiesCombining Search EnginesCombining Search Engines (cont’d.)Slide 17Confidence-Based Weight AssignmentsConfidence-Based Weight Assignments (cont’d.)Slide 20Slide 21Slide 22Overall PerformanceChoosing Good WeightsSample “Problem” QuestionsSample “Problem” Questions (cont’d.)THE DM MODULETHE DM MODULE (cont’d.)Slide 29Modeling the GameModeling the Game (cont’d.)Playing the Game: ResultsPlaying the Game: Results (cont’d.)Slide 34Slide 35Slide 36Hard Questions Easy, Easy Questions HardSix Questions to HumanSix Questions to Human (cont’d.)CONCLUSIONSDISCUSSION QUESTIONS1 Billion Pages = 1 Million Dollars?Mining the Web to Play “Who Wants to be a Millionaire?”Presented byBhavana TapdeByShyong Lam, David Pennock, Dan Cosley, Steve LawrenceCS791 – Technologies of GoogleMarch 21, 2007OUTLINEIntroductionPlaying MillionaireQuestion-Answering (QA) ModuleDiscussion: QA ModuleDecision-Making (DM) ModuleDiscussion: DM ModuleConclusionINTRODUCTION“Who Wants To Be A Millionaire?” – very popular ABC TV game show.http://www.millionairetv.com/game/active/frame_game.htmlPurpose: Utilize the redundancy and volume of information available on the web, and build a computerized player for the game show.The player written by the authors, is based on a home version of the game: the Millionaire CD-ROM, 3rd edition.Characteristics of QuestionsMillionaire CD-ROM game contains 635 questions.Questions are categorized into seven difficulty levels. The lower difficulty levels contain more common sense and common knowledge questions.To examine algorithms and tuning parameters, three random 90-question samples and one random 180-question sample are used.PLAYING MILLIONAIREThe Player The Millionaire player consists of two main componentsQuestion-Answering (QA) module for multiple-choice questions and Decision-Making (DM) modulePLAYING MILLIONAIRETHE QA MODULEPurpose: Make use of text corpora on the web, and different strategies to find correct answer for question, out of four choices.Uses–World Wide Web as data source–Several search engines (most prominently Google)Naive Approach: CountingQuery Google with the question along with each of the four answers. Correct answer? – one that produces the highest number of search results. “inverted” questions – the answer is the one that is unlike the other three. Can be identified by the word “not” in the question.The strategy answers about 50% of the questions correctly.QA MODULESimple Query ModificationsQuery transformations and modifications increase the percentage of correct responses to 60%.Enclose multiple-word answers in quotes to appear them as a phrase in any search results.For “Complete a saying” questions, construct each possible saying from choices and it should appear in the search results.QA MODULESimple Query Modifications (cont’d.)If query returns no results for any of the answers, use series of “fallback” queries that progressively relax the query.Longer web pages may contain less useful information, e.g. lists of links, essays, etc. Search engine results not restricted on page size. First-order approximation - exclude .pdf files from results.QA MODULEWord Proximity MeasuresKey idea - Not only answers appear in the same documents as questions, but also they usually appear near the question words.Download first 10 pages returned by Google for each query.Score each document based on a heuristic DistanceScore. Each question word has a score between 0 and 1 depending on how close the word is from the answer word.Use the average score per answer word in the document to further filter the documents.QA MODULEWord Proximity Measures (cont’d.)Table 1*: Pseudocode for DistanceScore, our proximity scoring method for favoring question words that appear near (within rad words of) answer words.// wordList is the document split at spacesDistanceScore(wordList, qWords, aWords, rad) score, answerWords = 0 for i = 1 to |wordList| do if wordList[i] is in aWords then answerWords = answerWords + 1 for j = (i-rad) to (i+rad) do if wordList[j] is in qWords thenscore += (rad - abs(i-j)) / rad if answerWords == 0 then return 0 else return score / answerWords* All tables and figures are taken from the paper.QA MODULEWord Proximity Measures (cont’d.)Figure 1: Question-answering accuracy versus proximity radius when using DistanceScore, as compared to the naïve method on three 90-question samples. Each line represents performance on one sample.QA MODULENoun-Phrase ProximityIdentify noun phrases using heuristics based on Brill’s Part-of-Speech tagger.Submit each {noun-phrase, answer} pair to Google and score the results the same way as before.Naive method produces poor results.DistanceScore produces results comparable to the previous two strategies.QA MODULECombining StrategiesAmong the naive, DistanceScore based on naive, and DistanceScore based on noun phrase strategies, at least one has the correct answer for about 85% of the questions in a 180-question sample.Combine three strategies, and produce a single score for each possible answer.ci =  ws * (Si / max {S1..n}) over all strategies ci – combined score for answer i ws – weight for strategy S Si – score for strategy S for answer i n – number of candidate answersPerformance increases to 70%.QA MODULECombining Search EnginesUse of multiple search engines improves results.Modify each of the three strategies to submit queries to AllTheWeb, MSN Search, and AltaVista, using syntax appropriate for each engine.Combine scores using the same formula as before.QA MODULECombining Search Engines (cont’d.)engine naive proxim phr prox combinedGoogle 55.6% 55.0% 68.9% 70%AllTheWeb 56.1% 51.7% 58.3% 66%MSN 44.4% 48.9% 47.2% 58%AltaVista 46.7% 55.6% 56.1% 68%Google performs better than the other engines.Table 2: Performance of the three strategies using different search engines.QA MODULECombining Search Engines (cont’d.)Combine the


View Full Document

ODU CS 791 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?