DOC PREVIEW
CU-Boulder CSCI 5417 - Lecture 8

This preview shows page 1-2-3-26-27-28 out of 28 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 8 9 15 2011 Today 9 15 Finish evaluation discussion Query improvement Relevance feedback Pseudo relevance feedback Query expansion 9 19 11 CSCI 5417 IR 2 1 Evaluation Summary measures Precision at fixed retrieval level Perhaps most appropriate for web search all people want are good matches on the first one or two results pages But has an arbitrary parameter of k 11 point interpolated average precision The standard measure in the TREC competitions you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents using interpolation the value for 0 is always interpolated and average them Evaluates performance at all recall levels 9 19 11 CSCI 5417 IR 3 Typical good 11 point precisions SabIR Cornell 8A1 11pt precision from TREC 8 1999 1 Precision 0 8 0 6 0 4 0 2 0 0 9 19 11 0 2 0 4 0 6 Recall CSCI 5417 IR 0 8 1 4 2 Yet more evaluation measures Mean average precision MAP Average of the precision value obtained for the top k documents each time a relevant doc is retrieved Avoids interpolation use of fixed recall levels MAP for query collection is arithmetic avg Macro averaging each query counts equally 9 19 11 CSCI 5417 IR 5 Recall Precision 1 2 3 4 5 6 7 8 9 10 R N N R R N R N N N R 10 10 10 20 30 30 40 40 40 40 P 100 50 33 50 60 50 57 50 44 40 50 60 57 6675 9 19 11 MAP 100 CSCI 5417 6 3 Variance For a test collection it is usual that a system does poorly on some information needs e g MAP 0 1 and excellently on others e g MAP 0 7 Indeed it is usually the case that the variance in performance of the same system across queries is much greater than the variance of different systems on the same query That is there are easy information needs and hard ones 9 19 11 CSCI 5417 7 Finally All of these measures are used for distinct comparison purposes System A vs System B Approach A vs Approach B Vector space approach vs Probabilistic approaches Systems on different collections System A 1 1 vs System A 1 2 System A on med vs trec vs web text They don t represent absolute measures 9 19 11 CSCI 5417 8 4 From corpora to test collections Still need Test queries Relevance assessments Test queries Must be germane to docs available Best designed by domain experts Random query terms generally not a good idea Relevance assessments Human judges time consuming Human panels are not perfect 9 19 11 CSCI 5417 9 Pooling With large datasets it s impossible to really assess recall You would have to look at every document So TREC uses a technique called pooling 9 19 11 Run a query on a representative set of state of the art retrieval systems Take the union of the top N results from these systems Have the analysts judge the relevant docs in this set CSCI 5417 10 5 TREC TREC Ad Hoc task from first 8 TRECs is standard IR task 50 detailed information needs a year Human evaluation of pooled results returned More recently other related things Web track HARD Bio Q A A TREC query TREC 5 top num Number 225 desc Description What is the main function of the Federal Emergency Management Agency FEMA and the funding level provided to meet emergencies Also what resources are available to FEMA such as people equipment facilities top 9 19 11 CSCI 5417 11 Critique of Pure Relevance Relevance vs Marginal Relevance A document can be redundant even if it is highly relevant Duplicates The same information from different sources Marginal relevance is a better measure of utility for the user Using facts entities as evaluation units more directly measures true relevance But harder to create evaluation set 9 19 11 CSCI 5417 12 6 Search Engines How does any of this apply to the big search engines 9 19 11 CSCI 5417 13 Evaluation at large search engines Recall is difficult to measure for the web Search engines often use precision at top k e g k 10 Or measures that reward you more for getting rank 1 right than for getting rank 10 right NDCG Normalized Cumulative Discounted Gain Search engines also use non relevance based measures Clickthrough on first result Not very reliable if you look at a single clickthrough but pretty reliable in the aggregate Studies of user behavior in the lab A B testing Focus groups Diary studies 9 19 11 CSCI 5417 14 7 A B testing Purpose Test a single innovation Prerequisite You have a system up and running Have most users use old system Divert a small proportion of traffic e g 1 to the new system that includes the innovation Evaluate with an automatic measure like clickthrough on first result Now we can directly see if the innovation does improve user happiness Probably the evaluation methodology that large search engines trust most 9 19 11 CSCI 5417 15 Query to think about E g Information need I m looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine Query wine red white heart attack effective 9 19 11 CSCI 5417 IR 16 8 Sources of Errors unranked Relevant Not Relevant Retrieved a b Not Retrieved c d What s happening in boxes c and b 9 19 11 CSCI 5417 IR 17 Retrieved Not Relevant b Documents are retrieved but are found to be not relevant Term overlap between query and doc but not relevant overlap About other topics entirely 9 19 11 Terms in isolation are on target Terms are homonymous off target About the topic but peripheral to information need CSCI 5417 IR 18 9 Not Retrieved Relevant c No overlap in terms between the query and docs zero hits Documents and users using different vocabulary Synonymy Automobile vs car HIV vs AIDS Overlap but not enough Problem with weighting schemes Problem with similarity metric Tf iDF Cosine 9 19 11 CSCI 5417 IR 19 Ranked Results Contingency tables are somewhat limited as tools because they re cast in terms of retrieved not retrieved That s rarely the case in ranked retrieval Problems b and c are duals of the same problem Why was this irrelevant document ranked higher than this relevant document 9 19 11 Why was this irrelevant doc ranked so high Why was this relevant doc ranked so low CSCI 5417 IR 20 10 Discussion Examples Query top num Number OHSU42 title 43 y o pt with delirium hypertension tachycardia desc Description thyrotoxicosis diagnosis and management top 9 19 11 CSCI 5417 IR 21 Examples Doc 1 W A 57 year old woman presented with palpitations muscle weakness bilateral proptosis goiter and tremor The thyroxine T4 level and the free T4 index were increased while the total triiodothyronine T3 level was normal Iodine


View Full Document

CU-Boulder CSCI 5417 - Lecture 8

Download Lecture 8
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 8 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 8 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?