Wright CS 707 - Evaluation of IR Systems

Unformatted text preview:

Evaluation of IR SystemsThis lectureResult SummariesSummariesStatic summariesDynamic summariesGenerating dynamic summariesSlide 8Alternative results presentations?Evaluating search enginesMeasures for a search engineSlide 12Data Retrieval vs Information RetrievalMeasuring user happinessSlide 15Happiness: elusive to measureEvaluating an IR systemDifficulties with gauging RelevancyStandard relevance benchmarksUnranked retrieval evaluation: Precision and RecallPrecision and RecallPrecision and Recall in PracticeShould we instead use the accuracy measure for evaluation?Why not just use accuracy?Precision/RecallTrade-offsDifficulties in using precision/recallA combined measure: FAka E Measure (parameterized F Measure)F1 and other averagesBreakeven PointEvaluating ranked resultsComputing Recall/Precision Points: An ExampleA precision-recall curveInterpolating a Recall/Precision CurveAverage Recall/Precision CurveEvaluation Metrics (cont’d)Typical (good) 11 point precisions11 point precisionsReceiver Operating Characteristics (ROC) CurveMean average precision (MAP)Average PrecisionMean Average Precision (MAP)MAPSummarize a Ranking: MAP(cont’d)Discounted Cumulative GainSlide 48Summarize a Ranking: DCGSlide 50DCG ExampleSummarize a Ranking: NDCGNDCG - ExampleSlide 54R- PrecisionVarianceTest CollectionsCreating Test Collections for IR EvaluationFrom document collections to test collectionsCan we avoid human judgment?Approximate vector retrievalAlternative proposalKappa measure for inter-judge (dis)agreementKappa Measure: ExampleKappa ExampleOther Evaluation MeasuresFallout RateSubjective Relevance MeasureOther Factors to ConsiderSKIP DETAILSEarly Test CollectionsCritique of pure relevanceEvaluation at large search enginesA/B testingTREC BenchmarksThe TREC BenchmarkThe TREC ObjectivesTREC AdvantagesTREC TasksTRECStandard relevance benchmarks: OthersCharacteristics of the TREC CollectionMore Details on Document CollectionsTREC Disk 4,5Sample Document (with SGML)Sample Query (with SGML)TREC PropertiesTwo more TREC Document ExamplesAnother Example of TREC Topic/QueryEvaluationSlide 91Cystic Fibrosis (CF) CollectionCF Document FieldsEvaluation of IR SystemsAdapted from Lectures byPrabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)Prasad 1L10Evaluation2This lectureResults summaries:Making our good results usable to a userHow do we know if our results are any good? Evaluating a search engineBenchmarksPrecision and recallPrasad L10Evaluation3Result SummariesHaving ranked the documents matching a query, we wish to present a results list.Most commonly, a list of the document titles plus a short summary, aka “10 blue links”.4SummariesThe title is typically automatically extracted from document metadata. What about the summaries?This description is crucial.User can identify good/relevant hits based on description.Two basic kinds: A static summary of a document is always the same, regardless of the query that hit the doc.A dynamic summary is a query-dependent attempt to explain why the document was retrieved for the query at hand.Prasad L10EvaluationStatic summariesIn typical systems, the static summary is a subset of the document.Simplest heuristic: the first 50 (or so – this can be varied) words of the documentSummary cached at indexing timeMore sophisticated: extract from each document a set of “key” sentencesSimple NLP heuristics to score each sentenceSummary is made up of top-scoring sentences.Most sophisticated: NLP used to synthesize a summarySeldom used in IR (cf. text summarization work)5Dynamic summariesPresent one or more “windows” within the document that contain several of the query terms“KWIC” snippets: Keyword in Context presentationGenerated in conjunction with scoringIf query found as a phrase, all or some occurrences of the phrase in the docIf not, document windows that contain multiple query termsThe summary itself gives the entire content of the window – all terms, not only the query terms.7Generating dynamic summariesIf we have only a positional index, we cannot (easily) reconstruct context window surrounding hits.If we cache the documents at index time, then we can find windows in it, cueing from hits found in the positional index.E.g., positional index says “the query is a phrase in position 4378” so we go to this position in the cached document and stream out the contentMost often, cache only a fixed-size prefix of the doc.Note: Cached copy can be outdatedPrasad L10Evaluation8Dynamic summariesProducing good dynamic summaries is a tricky optimization problemThe real estate for the summary is normally small and fixedWant snippets to be long enough to be usefulWant linguistically well-formed snippets Want snippets maximally informative about docBut users really like snippets, even if they complicate IR system designPrasad L10EvaluationAlternative results presentations?An active area of HCI researchAn alternative: http://www.searchme.com / copies the idea of Apple’s Cover Flow for search resultsPrasad 9L10EvaluationEvaluating search enginesPrasad 10L10Evaluation11Measures for a search engineHow fast does it indexNumber of documents/hour(Average document size)How fast does it searchLatency as a function of index sizeExpressiveness of query languageAbility to express complex information needsSpeed on complex queriesUncluttered UIIs it free?Prasad L10Evaluation12Measures for a search engineAll of the preceding criteria are measurable: we can quantify speed/size; we can make expressiveness preciseThe key measure: user happinessWhat is this?Speed of response/size of index are factors But blindingly fast, useless answers won’t make a user happyNeed a way of quantifying user happinessPrasad L10Evaluation13Data Retrieval vs Information RetrievalDR Performance Evaluation (after establishing correctness)Response timeIndex spaceIR Performance EvaluationHow relevant is the answer set? (required to establish functional correctness, e.g., through benchmarks) Prasad L10Evaluation14Measuring user happinessIssue: who is the user we are trying to make happy?Depends on the setting/contextWeb engine: user finds what they want and return to the engineCan measure rate of return userseCommerce site: user finds what they want and make a purchaseIs it


View Full Document

Wright CS 707 - Evaluation of IR Systems

Download Evaluation of IR Systems
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Evaluation of IR Systems and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Evaluation of IR Systems 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?