Stanford CS 276 - Lecture 8: Evaluation - D3623812

Home> Schools> Stanford University> Computer Science (CS) > CS 276> Lecture 8: Evaluation

Stanford CS 276 - Lecture 8: Evaluation

Course Cs 276- Information Retrieval and Web Search

Pages 48

Download Save

Unformatted text preview:

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 8 Evaluation Introduction to Information Retrieval Introduction to Information Retrieval Sec 6 2 This lecture How do we know if our results are any good Evaluating a search engine Benchmarks Precision and recall Results summaries Making our good results usable to a user 2 Introduction to Information Retrieval Introduction to Information Retrieval EVALUATING SEARCH ENGINES Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 6 Measures for a search engine How fast does it index Number of documents hour Average document size How fast does it search Latency as a function of index size Expressiveness of query language Ability to express complex information needs Speed on complex queries Uncluttered UI Is it free 4 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 6 Measures for a search engine All of the preceding criteria are measurable we can quantify speed size we can make expressiveness precise The key measure user happiness What is this Speed of response size of index are factors But blindingly fast useless answers won t make a user happy Need a way of quantifying user happiness 5 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 6 2 Measuring user happiness Issue who is the user we are trying to make happy Depends on the setting Web engine User finds what s he wants and returns to the engine Can measure rate of return users User completes task search as a means not end See Russell http dmrussell googlepages com JCDL talk June 2007 short pdf eCommerce site user finds what s he wants and buys Is it the end user or the eCommerce site whose happiness we Measure time to purchase or fraction of searchers who become measure buyers 6 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 6 2 Measuring user happiness Enterprise company govt academic Care about user productivity How much time do my users save when looking for Many other criteria having to do with breadth of access information secure access etc 7 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 1 Happiness elusive to measure Most common proxy relevance of search results But how do you measure relevance We will detail a methodology here then examine its issues Relevance measurement requires 3 elements 1 A benchmark document collection 2 A benchmark suite of queries 3 A usually binary assessment of either Relevant or Nonrelevant for each query and each document Some work on more than binary but not the standard 8 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 1 Evaluating an IR system Note the information need is translated into a query Relevance is assessed relative to the information need not the query E g Information need I m looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine Query wine red white heart attack effective Evaluate whether the doc addresses the information need not whether it has these words 9 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 2 Standard relevance benchmarks TREC National Institute of Standards and Technology NIST has run a large IR test bed for many years Reuters and other benchmark doc collections used Retrieval tasks specified sometimes as queries Human experts mark for each query and for each doc Relevant or Nonrelevant or at least for subset of docs that some system returned for that query 10 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 3 Unranked retrieval evaluation Precision and Recall Precision fraction of retrieved docs that are relevant Recall fraction of relevant docs that are retrieved P relevant retrieved P retrieved relevant Retrieved Not Retrieved Relevant tp fn Nonrelevant fp tn Precision P tp tp fp Recall R tp tp fn 11 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 3 Should we instead use the accuracy measure for evaluation Given a query an engine classifies each doc as Relevant or Nonrelevant The accuracy of an engine the fraction of these classifications that are correct tp tn tp fp fn tn Accuracy is a commonly used evaluation measure in machine learning classification work Why is this not a very useful evaluation measure in IR 12 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 3 Why not just use accuracy How to build a 99 9999 accurate search engine on a low budget Search for 0 matching results found People doing information retrieval want to find something and have a certain tolerance for junk 13 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 3 Precision Recall You can get high recall but low precision by retrieving all docs for all queries Recall is a non decreasing function of the number of docs retrieved In a good system precision decreases as either the number of docs retrieved or recall increases This is not a theorem but a result with strong empirical confirmation 14 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 3 Difficulties in using precision recall Should average over large document collection query ensembles Need human relevance assessments People aren t reliable assessors Assessments have to be binary Nuanced assessments Heavily skewed by collection authorship Results may not translate from one domain to another 15 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 3 A combined measure F Combined measure that assesses precision recall tradeoff is F measure weighted harmonic mean 1 1 F 1 P 1 R 2 2 PR RP 1 People usually use balanced F1 measure i e with 1 or Harmonic mean is a conservative average See CJ van Rijsbergen Information Retrieval 16 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 3 F1 and other averages Combined Measures 100 80 60 40 20 0 Minimum Maximum Arithmetic Geometric Harmonic 0 20 40 60 80 100 Precision Recall fixed at 70 17 Introduction to Information Retrieval Introduction to Information Retrieval Sec 8 4 Evaluating ranked results Evaluation of ranked results The system can return any number of results By taking various numbers of the top returned

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 276 - Lecture 8: Evaluation

Sign up for free to view:

Please select your school