CSCI 5417 Information Retrieval Systems Jim MartinTodayNormal Cosine ScoringSpeedups...Generic Approach to Reducing CosinesImpact-Ordered Postings1. Early Termination2. IDF-ordered termsEvaluationEvaluation Metrics for Search EnginesSlide 11Measuring user happinessSlide 13Happiness: Difficult to MeasureEvaluating an IR systemStandard Relevance BenchmarksUnranked Retrieval EvaluationAccuracy/Error RateUnranked Retrieval Evaluation: Precision and RecallPrecision/RecallDifficulties in Using Precision/RecallEvaluating Ranked ResultsRecall/PrecisionSlide 24Slide 25A Precision-Recall curveAveraging over queriesInterpolated precisionInterpolated ValuesAn Interpolated Precision-Recall CurveTypical (good) 11 point precisionsBreakSlide 33Yet more evaluation measures…VarianceFinallyFrom corpora to test collectionsPoolingTRECCritique of Pure RelevanceSearch Engines…Evaluation at large search enginesA/B testingNext TimeCSCI 5417Information Retrieval SystemsJim MartinLecture 79/13/201101/14/19 CSCI 5417 2TodayReviewEfficient scoring schemesApproximate scoringEvaluating IR systems01/14/19 CSCI 5417 3Normal Cosine Scoring01/14/19 CSCI 5417 4Speedups...Compute the cosines fasterDon’t compute as many cosines01/14/19 CSCI 5417 5Generic Approach to Reducing CosinesFind a set A of contenders, with K < |A| << NA does not necessarily contain the top K, but has many docs from among the top KReturn the top K docs in AThink of A as pruning likely non-contenders01/14/19 CSCI 5417 6Impact-Ordered PostingsWe really only want to compute scores for docs for which wft,d is high enoughLow scores are unlikely to change the ordering or reach the top KSo sort each postings list by wft,dHow do we compute scores in order to pick off top K?Two ideas follow01/14/19 CSCI 5417 71. Early TerminationWhen traversing t’s postings, stop early after eitherAfter a fixed number of docs orwft,d drops below some thresholdTake the union of the resulting sets of docsfrom the postings of each query termCompute only the scores for docs in this union01/14/19 CSCI 5417 82. IDF-ordered termsWhen considering the postings of query termsLook at them in order of decreasing IDFHigh IDF terms likely to contribute most to scoreAs we update score contribution from each query termStop if doc scores relatively unchangedEvaluation01/14/19 CSCI 5417 901/14/19 CSCI 5417 10Evaluation Metrics for Search EnginesHow fast does it index?Number of documents/hourRealtime searchHow fast does it search?Latency as a function of index sizeExpressiveness of query languageAbility to express complex information needsSpeed on complex queries01/14/19 CSCI 5417 11Evaluation Metrics for Search EnginesAll of the preceding criteria are measurable: we can quantify speed/size; we can make expressiveness preciseBut the key really is user happinessSpeed of response/size of index are factorsBut blindingly fast, useless answers won’t make a user happyWhat makes people come back?Need a way of quantifying user happiness01/14/19 CSCI 5417 12Measuring user happinessIssue: Who is the user we are trying to make happy?Web engine: user finds what they want and returns often to the engineCan measure rate of return userseCommerce site: user finds what they want and makes a purchaseMeasure time to purchase, or fraction of searchers who become buyers?01/14/19 CSCI 5417 13Measuring user happinessEnterprise (company/govt/academic): Care about “user productivity”How much time do my users save when looking for information?Many other criteria having to do with breadth of access, secure access, etc.01/14/19 CSCI 5417 14Happiness: Difficult to MeasureMost common proxy for user happiness is relevance of search resultsBut how do you measure relevance?We will detail one methodology here, then examine its issuesRelevance measurement requires 3 elements:1. A benchmark document collection2. A benchmark suite of queries3. A binary assessment of either Relevant or Not relevant for query-doc pairsSome work on more-than-binary, but not typical01/14/19 CSCI 5417 15Evaluating an IR systemThe information need is translated into a queryRelevance is assessed relative to the information need not the queryE.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.Query: wine red white heart attack effectiveYou evaluate whether the doc addresses the information need, not whether it has those words01/14/19 CSCI 5417 16Standard Relevance BenchmarksTREC - National Institute of Standards and Testing (NIST) has run a large IR test-bed for many yearsReuters and other benchmark doc collections used“Retrieval tasks” specifiedsometimes as queriesHuman experts mark, for each query and for each doc, Relevant or IrrelevantFor at least for subset of docs that some system returned for that query01/14/19 CSCI 5417 17Unranked Retrieval EvaluationAs with any such classification task there are 4 possible system outcomes: a, b, c and da and d represent correct responses. c and b are mistakes.False pos/False negType 1/Type 2 errorsRelevant Not RelevantRetrieved a bNot Retrievedc d01/14/19 CSCI 5417 18Accuracy/Error RateGiven a query, an engine classifies each doc as “Relevant” or “Irrelevant”.Accuracy of an engine: the fraction of these classifications that is correct.a+d/a+b+c+dThe number of correct judgments out of all the judgments made.Why is accuracy useless for evaluating large search engings?01/14/19 CSCI 5417 19Unranked Retrieval Evaluation:Precision and RecallPrecision: fraction of retrieved docs that are relevant = P(relevant|retrieved)Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant)Precision P = a/(a+b)Recall R = a/(a+c)Relevant Not RelevantRetrieved a bNot Retrieved c d01/14/19 CSCI 5417 20Precision/RecallYou can get high recall (but low precision) by retrieving all docs for all queries!Recall is a non-decreasing function of the number of docs retrievedThat is, recall either stays the same or increases as you return more docsIn a most systems, precision decreases with the number of docs retrieved Or as recall increasesA fact with strong empirical confirmation01/14/19 CSCI 5417 21Difficulties in Using
View Full Document