Evaluation of IR SystemsThis lectureResult SummariesSummariesStatic summariesDynamic summariesGenerating dynamic summariesSlide 8Alternative results presentations?Evaluating search enginesMeasures for a search engineSlide 12Data Retrieval vs Information RetrievalMeasuring user happinessSlide 15Happiness: elusive to measureEvaluating an IR systemDifficulties with gauging RelevancyStandard relevance benchmarksUnranked retrieval evaluation: Precision and RecallPrecision and RecallPrecision and Recall in PracticeShould we instead use the accuracy measure for evaluation?Why not just use accuracy?Precision/RecallTrade-offsDifficulties in using precision/recallA combined measure: FAka E Measure (parameterized F Measure)F1 and other averagesBreakeven PointEvaluating ranked resultsComputing Recall/Precision Points: An ExampleA precision-recall curveInterpolating a Recall/Precision CurveAverage Recall/Precision CurveEvaluation Metrics (cont’d)Typical (good) 11 point precisions11 point precisionsReceiver Operating Characteristics (ROC) CurveMean average precision (MAP)Average PrecisionMean Average Precision (MAP)MAPSummarize a Ranking: MAP(cont’d)Discounted Cumulative GainSlide 48Summarize a Ranking: DCGSlide 50DCG ExampleSummarize a Ranking: NDCGNDCG - ExampleSlide 54R- PrecisionVarianceTest CollectionsCreating Test Collections for IR EvaluationFrom document collections to test collectionsCan we avoid human judgment?Approximate vector retrievalAlternative proposalKappa measure for inter-judge (dis)agreementKappa Measure: ExampleKappa ExampleOther Evaluation MeasuresFallout RateSubjective Relevance MeasureOther Factors to ConsiderSKIP DETAILSEarly Test CollectionsCritique of pure relevanceEvaluation at large search enginesA/B testingTREC BenchmarksThe TREC BenchmarkThe TREC ObjectivesTREC AdvantagesTREC TasksTRECStandard relevance benchmarks: OthersCharacteristics of the TREC CollectionMore Details on Document CollectionsTREC Disk 4,5Sample Document (with SGML)Sample Query (with SGML)TREC PropertiesTwo more TREC Document ExamplesAnother Example of TREC Topic/QueryEvaluationSlide 91Cystic Fibrosis (CF) CollectionCF Document FieldsEvaluation of IR SystemsAdapted from Lectures byPrabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)Prasad 1L10Evaluation2This lectureResults summaries:Making our good results usable to a userHow do we know if our results are any good? Evaluating a search engineBenchmarksPrecision and recallPrasad L10Evaluation3Result SummariesHaving ranked the documents matching a query, we wish to present a results list.Most commonly, a list of the document titles plus a short summary, aka “10 blue links”.4SummariesThe title is typically automatically extracted from document metadata. What about the summaries?This description is crucial.User can identify good/relevant hits based on description.Two basic kinds: A static summary of a document is always the same, regardless of the query that hit the doc.A dynamic summary is a query-dependent attempt to explain why the document was retrieved for the query at hand.Prasad L10EvaluationStatic summariesIn typical systems, the static summary is a subset of the document.Simplest heuristic: the first 50 (or so – this can be varied) words of the documentSummary cached at indexing timeMore sophisticated: extract from each document a set of “key” sentencesSimple NLP heuristics to score each sentenceSummary is made up of top-scoring sentences.Most sophisticated: NLP used to synthesize a summarySeldom used in IR (cf. text summarization work)5Dynamic summariesPresent one or more “windows” within the document that contain several of the query terms“KWIC” snippets: Keyword in Context presentationGenerated in conjunction with scoringIf query found as a phrase, all or some occurrences of the phrase in the docIf not, document windows that contain multiple query termsThe summary itself gives the entire content of the window – all terms, not only the query terms.7Generating dynamic summariesIf we have only a positional index, we cannot (easily) reconstruct context window surrounding hits.If we cache the documents at index time, then we can find windows in it, cueing from hits found in the positional index.E.g., positional index says “the query is a phrase in position 4378” so we go to this position in the cached document and stream out the contentMost often, cache only a fixed-size prefix of the doc.Note: Cached copy can be outdatedPrasad L10Evaluation8Dynamic summariesProducing good dynamic summaries is a tricky optimization problemThe real estate for the summary is normally small and fixedWant snippets to be long enough to be usefulWant linguistically well-formed snippets Want snippets maximally informative about docBut users really like snippets, even if they complicate IR system designPrasad L10EvaluationAlternative results presentations?An active area of HCI researchAn alternative: http://www.searchme.com / copies the idea of Apple’s Cover Flow for search resultsPrasad 9L10EvaluationEvaluating search enginesPrasad 10L10Evaluation11Measures for a search engineHow fast does it indexNumber of documents/hour(Average document size)How fast does it searchLatency as a function of index sizeExpressiveness of query languageAbility to express complex information needsSpeed on complex queriesUncluttered UIIs it free?Prasad L10Evaluation12Measures for a search engineAll of the preceding criteria are measurable: we can quantify speed/size; we can make expressiveness preciseThe key measure: user happinessWhat is this?Speed of response/size of index are factors But blindingly fast, useless answers won’t make a user happyNeed a way of quantifying user happinessPrasad L10Evaluation13Data Retrieval vs Information RetrievalDR Performance Evaluation (after establishing correctness)Response timeIndex spaceIR Performance EvaluationHow relevant is the answer set? (required to establish functional correctness, e.g., through benchmarks) Prasad L10Evaluation14Measuring user happinessIssue: who is the user we are trying to make happy?Depends on the setting/contextWeb engine: user finds what they want and return to the engineCan measure rate of return userseCommerce site: user finds what they want and make a purchaseIs it
View Full Document