Princeton COS 435 - Evaluation of Retrieval Systems

Unformatted text preview:

11EvaluationofRetrieval Systems2Performance Criteria1. Expressiveness of query language• Can query language capture information needs?2. Quality of search results• Relevance to users’ information needs3. Usability• Search Interface• Results page format• Other?4. Efficiency– Speed affects usability– Overall efficiency affects cost of operation5. Other?23Quantitative evaluation• Concentrate on quality of search results• Goals for measure– Capture relevance to user information need– Allow comparison between results of differentsystems• Measures define for sets of documents returned• More generally “document” could be anyinformation object4Core measures: Precision and Recall• Need binary evaluation by human judge of eachretrieved document as relevant/irrelevant• Need know complete set of relevant documentswithin collection being searched• Recall = # relevant documents retrieved # relevant documents• Precision = # relevant documents retrieved # retrieved documents35Combine recall and precisionF-score (aka F-measure) defined to be:harmonic mean‡ of precision and recall 2*recall*precisionprecision+recall‡ The harmonic mean h of two numbers m and n satisfies(n-h)/n = (h-m)/m. Also = (1/m) -(1/h) = (1/h)-(1/n)=6Use in “modern times”• Defined in 1950s• For small collections, these make sense• For large collections,– Rarely know complete set relevant documents– Rarely could return complete set relevantdocuments• For large collections– Rank returned documents–Use ranking!47Ranked result list• At any point along ranked list– Can look at precision so far– Can look at recall so far• if know total # relevant docs• Google’s “about N results” inadequate estimate• Can focus on points that relevant docsappears– If mth doc in ranking is kth relevant doc so far,precision is k/m• No a priori ranking on relevant docs8Plot: precision versus recall• Choose standard recall levels: r1, r2 …– Eg 10%, 20% …– Define “precision at recall level rj”p(rj) = max over all r with rj≤r<rj+1 of precision when recall r achieved• Similar to Intro IR “interpolated precision”59Reproduced from presentation “Overview of TREC 2004” by Ellen Voorhees, available fromTREC presentations Web site: trec.nist.gov/presentations/TREC2004/04overview.pdf10Single number characterizations• Can look at precision at one fixed criticalposition: “Precision at k”– If know are R relevant documents can choose k=R• May not want to look that far even if know R– Can choose set of S relevant docs, and calc.precision at k=S only with respect to these docs• “R-precision” of Intro IR– For Web search• Choose k to be number pages people look at• k=? What expecting?611Single number characterizations, cont.1) Record precision at each point a relevantdocument encountered through ranked list• Don’t need know all relevant docs• Can cut off ranked list at predetermined rank2) Average the recorded precisions in (1)= average precision for a query resultMean Average Precision (MAP):For a set of test queries, take the mean (i.e. average)Of the average precision for each query• Compare retrieval systems with MAP12Using Measures• Statistical significance versusmeaningfulness• Use more than one measure• Need some set of relevant docs even ifdon’t have complete setHow?– Look at TREC studies713Relevance by TREC methodText Retrieval Conference 1992 to present• Fixed collection per “track”• E.g. “*.gov”, CACM articles• Each competing search engine for a trackasked to retrieve documents on several“topics”– Search engine turns topic into query– Topic description has clear statement of whatis to be considered relevant by human judge14Sample TREC 3 topic:<num> Number: 168<title> Topic: Financing AMTRAK<desc> Description:A document will address the role of the Federal Government infinancing the operation of the National Railroad TransportationCorporation (AMTRAK).<narr> Narrative: A relevant document must provide information on thegovernment’s responsibility to make AMTRAK an economicallyviable entity. It could also discuss the privatization of AMTRAK asan alternative to continuing government subsidies. Documentscomparing government subsidies given to air and bus transportationwith those provided to AMTRAK would also be relevant.</top>As appeared in “Overview of the Sixth Text REtrieval Conference (TREC-6),” E. M. Voorhees and D.Harman, in NIST Special Publication 500-240: The Sixth Text REtrieval Conference , 1997.815Sample TREC 7 topic:<num>Number: 396<title> sick building syndrome<desc>Description:Indentify documents that discuss sick building syndrome or building-related illnesses.<narr> Narrative:A relevant document would contain any data that refers to the sickbuilding or building-related illnesses, including illnesses cause byasbestos, air conditioning, pollution controls. Work-related illnessesnot caused by the building, such as carpal tunnel syndrome, are notrelevant.From “Overview of the Seventh Text REtrieval Conference (TREC-7),” E. M. Voorhees and D.Harman, in NIST Special Publication 500-242: The Seventh Text REtrieval Conference , 1998.16Pooling• Human judges can’t look at all docs incollection: thousands to millions• Pooling chooses subset of docs ofcollection for human judges to raterelevance of• Assume docs not in pool not relevant917How construct pool for a topic?Let competing search engines decide:• Choose a parameter k (typically 100)• Choose the top k docs as ranked by eachsearch engine• Pool = union of these sets of docsBetween k and (# search engines) * k docs in pool• Give pool to judges for relevance scoring18Pooling cont.• (k+1)st doc returned by one search engineeither irrelevant or ranked higher byanother search engine in competition• In competition, each search engine isjudged on results for top r > k docsreturned1019Reproduced from presentation “Overview of TREC 2004” by Ellen Voorhees, available fromTREC presentations Web site: trec.nist.gov/presentations/TREC2004/04overview.pdf20Web search evaluation• Are different kinds of queries – identified inTREC Web Track – some are:– Ad hoc– Topic distillation: set of key resources small, 100%recall?– Home page: # relevant pages = 1 (except mirrors)– Distinguish for competitors or just judges?• Andrei Broder gave similar categories– Information• Broad research


View Full Document

Princeton COS 435 - Evaluation of Retrieval Systems

Download Evaluation of Retrieval Systems
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Evaluation of Retrieval Systems and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Evaluation of Retrieval Systems 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?