DOC PREVIEW
UMD LBSC 796 - LBSC 796 Lecture 2

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1LBSC 796/INFM 718R: Week 2EvaluationJimmy LinCollege of Information StudiesUniversity of MarylandMonday, February 6, 2006IR is an experimental science!| Formulate a research question: the hypothesis| Design an experiment to answer the question| Perform the experimentz Compare with a baseline “control”| Does the experiment answer the question?z Are the results significant? Or is it just luck?| Report the results!| Rinse, repeat…Questions About the Black Box| Example “questions”:z Does morphological analysis improve retrieval performance?z Does expanding the query with synonyms improve retrieval performance?| Corresponding experiments:z Build a “stemmed” index and compare against “unstemmed” baselinez Expand queries with synonyms and compare against baseline unexpanded queriesQuestions That Involve Users| Example “questions”:z Does keyword highlighting help users evaluate document relevance?z Is letting users weight search terms a good idea?| Corresponding experiments:z Build two different interfaces, one with keyword highlighting, one without; run a user studyz Build two different interfaces, one with term weighting functionality, and one without; run a user studyThe Importance of Evaluation| The ability to measure differences underlies experimental sciencez How well do our systems work?z Is A better than B?z Is it really?z Under what conditions?| Evaluation drives what to researchz Identify techniques that work and don’t workz Formative vs. summative evaluationsDesiderata for Evaluations| Insightful| Affordable| Repeatable| Explainable2Summary| Qualitative user studies suggest what to build| Decomposition breaks larger tasks into smaller components| Automated evaluation helps to refine components| Quantitative user studies show how well everything works togetherOutline| Evaluating the IR black boxz How do we conduct experiments with reusable test collections?z What exactly do we measure?z Where do these test collections come from?| Studying the user and the systemz What sorts of (different) things do we measure when a human is in the loop?| Coming up with the right questionsz How do we know what to evaluate and study?Types of Evaluation Strategies| System-centered studiesz Given documents, queries, and relevance judgmentsz Try several variations of the systemz Measure which system returns the “best” hit list| User-centered studiesz Given several users, and at least two retrieval systemsz Have each user try the same task on both systemsz Measure which system works the “best”Evaluation Criteria| Effectivenessz How “good” are the documents that are returned?z System only, human + system| Efficiencyz Retrieval time, indexing time, index size| Usabilityz Learnability, frustrationz Novice vs. expert usersGood Effectiveness Measures| Should capture some aspect of what the user wantsz That is, the measure should be meaningful| Should have predictive value for other situationsz What happens with different queries on a different document collection?| Should be easily replicated by other researchers| Should be easily comparablez Optimally, expressed as a single numberThe Notion of Relevance| IR systems essentially facilitate communication between a user and document collections| Relevance is a measure of the effectiveness of communicationz Logic and philosophy present other approaches| Relevance is a relation… but between what?3What is relevance?measuredegreedimensionestimateappraisalrelationcorrespondenceutilityconnectionsatisfactionfitbearingmatchingdocumentarticletextual formreferenceinformation providedfactqueryrequestinformation usedpoint of viewinformation need statementpersonjudgeuserrequesterInformation specialistTefko Saracevic. (1975) Relevance: A Review of and a Framework for Thinking on the Notion in Information Science. Journal of the American Society for Information Science, 26(6), 321-343;Relevance is the of aexisting between a and a as determined byDoes this help?Mizzaro’s Model of Relevance| Four dimensions of relevance| Dimension 1: Information Resourcesz Informationz Documentz Surrogate| Dimension 2: Representation of User Problemz Real information needs (RIN) = visceral needz Perceived information needs (PIN) = conscious needz Request = formalized needz Query = compromised needStefano Mizzaro. (1999) How Many Relevances in Information Retrieval? Interacting With Computers, 10(3), 305-322.Time and Relevance| Dimension 3: TimeRIN0PIN0PINmr0r1q0…q1q2q3rnqrComponents and Relevance| Dimension 4: Componentsz Topicz Taskz ContextWhat are we after?| Ultimately, relevance of the informationz With respect to the real information needz At the conclusion of the information seeking processz Taking into consideration topic, task, and context| In system-based evaluations, what do we settle for?Rel( Information, RIN, t(f), {Topic, Task, Context} )Rel( surrogate, request, t(0), Topic )Rel( document, request, t(0), Topic )Evaluating the Black BoxSearchQueryRanked List4Evolution of the Evaluation| Evaluation by inspection of examples| Evaluation by demonstration| Evaluation by improvised demonstration| Evaluation on data using a figure of merit| Evaluation on test data| Evaluation on common test data| Evaluation on common, unseen test dataAutomatic Evaluation ModelIR Black BoxQueryRanked ListDocumentsEvaluationModuleMeasure of EffectivenessRelevance JudgmentsThese are the four things we need!Test Collections| Reusable test collections consist of:z Collection of documents• Should be “representative”• Things to consider: size, sources, genre, topics, …z Sample of information needs• Should be “randomized” and “representative”• Usually formalized topic statementsz Known relevance judgments• Assessed by humans, for each topic-document pair (topic, not query!)• Binary judgments make evaluation easier|Measure of effectivenessz Usually a numeric score for quantifying “performance”z Used to compare different systemsWhich is the Best Rank Order?= relevant documentA.B.C.D.E.F.Set-Based Measures| Precision = A ÷ (A+B)| Recall = A ÷ (A+C)| Miss = C ÷ (A+C)| False alarm (fallout) = B ÷ (B+D)DCNot retrievedBARetrievedNot relevantRelevantCollection size = A+B+C+DRelevant = A+CRetrieved = A+BWhen is precision important?When is recall important?Another ViewRelevant RetrievedRelevant +RetrievedNot Relevant + Not RetrievedSpace of all documents5F-Measure| Harmonic mean of recall and precision| Beta controls relative importance of


View Full Document

UMD LBSC 796 - LBSC 796 Lecture 2

Download LBSC 796 Lecture 2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LBSC 796 Lecture 2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LBSC 796 Lecture 2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?