Wright CS 707 - Evaluation of IR Systems - D812836

Home> Schools> Wright State University> Counseling (CS) > CS 707> Evaluation of IR Systems

Wright CS 707 - Evaluation of IR Systems

Pages 93

Download Save

Unformatted text preview:

Evaluation of IR SystemsThis lectureResult SummariesSummariesStatic summariesDynamic summariesGenerating dynamic summariesSlide 8Alternative results presentations?Evaluating search enginesMeasures for a search engineSlide 12Data Retrieval vs Information RetrievalMeasuring user happinessSlide 15Happiness: elusive to measureEvaluating an IR systemDifficulties with gauging RelevancyStandard relevance benchmarksUnranked retrieval evaluation: Precision and RecallPrecision and RecallPrecision and Recall in PracticeShould we instead use the accuracy measure for evaluation?Why not just use accuracy?Precision/RecallTrade-offsDifficulties in using precision/recallA combined measure: FAka E Measure (parameterized F Measure)F1 and other averagesBreakeven PointEvaluating ranked resultsComputing Recall/Precision Points: An ExampleA precision-recall curveInterpolating a Recall/Precision CurveAverage Recall/Precision CurveEvaluation Metrics (cont’d)Typical (good) 11 point precisions11 point precisionsReceiver Operating Characteristics (ROC) CurveMean average precision (MAP)Average PrecisionMean Average Precision (MAP)MAPSummarize a Ranking: MAP(cont’d)Discounted Cumulative GainSlide 48Summarize a Ranking: DCGSlide 50DCG ExampleSummarize a Ranking: NDCGNDCG - ExampleSlide 54R- PrecisionVarianceTest CollectionsCreating Test Collections for IR EvaluationFrom document collections to test collectionsCan we avoid human judgment?Approximate vector retrievalAlternative proposalKappa measure for inter-judge (dis)agreementKappa Measure: ExampleKappa ExampleOther Evaluation MeasuresFallout RateSubjective Relevance MeasureOther Factors to ConsiderSKIP DETAILSEarly Test CollectionsCritique of pure relevanceEvaluation at large search enginesA/B testingTREC BenchmarksThe TREC BenchmarkThe TREC ObjectivesTREC AdvantagesTREC TasksTRECStandard relevance benchmarks: OthersCharacteristics of the TREC CollectionMore Details on Document CollectionsTREC Disk 4,5Sample Document (with SGML)Sample Query (with SGML)TREC PropertiesTwo more TREC Document ExamplesAnother Example of TREC Topic/QueryEvaluationSlide 91Cystic Fibrosis (CF) CollectionCF Document FieldsEvaluation of IR SystemsAdapted from Lectures byPrabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)Prasad 1L10Evaluation2This lectureResults summaries:Making our good results usable to a userHow do we know if our results are any good? Evaluating a search engineBenchmarksPrecision and recallPrasad L10Evaluation3Result SummariesHaving ranked the documents matching a query, we wish to present a results list.Most commonly, a list of the document titles plus a short summary, aka “10 blue links”.4SummariesThe title is typically automatically extracted from document metadata. What about the summaries?This description is crucial.User can identify good/relevant hits based on description.Two basic kinds: A static summary of a document is always the same, regardless of the query that hit the doc.A dynamic summary is a query-dependent attempt to explain why the document was retrieved for the query at hand.Prasad L10EvaluationStatic summariesIn typical systems, the static summary is a subset of the document.Simplest heuristic: the first 50 (or so – this can be varied) words of the documentSummary cached at indexing timeMore sophisticated: extract from each document a set of “key” sentencesSimple NLP heuristics to score each sentenceSummary is made up of top-scoring sentences.Most sophisticated: NLP used to synthesize a summarySeldom used in IR (cf. text summarization work)5Dynamic summariesPresent one or more “windows” within the document that contain several of the query terms“KWIC” snippets: Keyword in Context presentationGenerated in conjunction with scoringIf query found as a phrase, all or some occurrences of the phrase in the docIf not, document windows that contain multiple query termsThe summary itself gives the entire content of the window – all terms, not only the query terms.7Generating dynamic summariesIf we have only a positional index, we cannot (easily) reconstruct context window surrounding hits.If we cache the documents at index time, then we can find windows in it, cueing from hits found in the positional index.E.g., positional index says “the query is a phrase in position 4378” so we go to this position in the cached document and stream out the contentMost often, cache only a fixed-size prefix of the doc.Note: Cached copy can be outdatedPrasad L10Evaluation8Dynamic summariesProducing good dynamic summaries is a tricky optimization problemThe real estate for the summary is normally small and fixedWant snippets to be long enough to be usefulWant linguistically well-formed snippets Want snippets maximally informative about docBut users really like snippets, even if they complicate IR system designPrasad L10EvaluationAlternative results presentations?An active area of HCI researchAn alternative: http://www.searchme.com / copies the idea of Apple’s Cover Flow for search resultsPrasad 9L10EvaluationEvaluating search enginesPrasad 10L10Evaluation11Measures for a search engineHow fast does it indexNumber of documents/hour(Average document size)How fast does it searchLatency as a function of index sizeExpressiveness of query languageAbility to express complex information needsSpeed on complex queriesUncluttered UIIs it free?Prasad L10Evaluation12Measures for a search engineAll of the preceding criteria are measurable: we can quantify speed/size; we can make expressiveness preciseThe key measure: user happinessWhat is this?Speed of response/size of index are factors But blindingly fast, useless answers won’t make a user happyNeed a way of quantifying user happinessPrasad L10Evaluation13Data Retrieval vs Information RetrievalDR Performance Evaluation (after establishing correctness)Response timeIndex spaceIR Performance EvaluationHow relevant is the answer set? (required to establish functional correctness, e.g., through benchmarks) Prasad L10Evaluation14Measuring user happinessIssue: who is the user we are trying to make happy?Depends on the setting/contextWeb engine: user finds what they want and return to the engineCan measure rate of return userseCommerce site: user finds what they want and make a purchaseIs it

View Full Document


School:
Email:
New Password:
Confirm Password:

Wright CS 707 - Evaluation of IR Systems

Sign up for free to view:

Please select your school