1LBSC 796/INFM 718R: Week 2EvaluationJimmy LinCollege of Information StudiesUniversity of MarylandMonday, February 6, 2006IR is an experimental science!| Formulate a research question: the hypothesis| Design an experiment to answer the question| Perform the experimentz Compare with a baseline “control”| Does the experiment answer the question?z Are the results significant? Or is it just luck?| Report the results!| Rinse, repeat…Questions About the Black Box| Example “questions”:z Does morphological analysis improve retrieval performance?z Does expanding the query with synonyms improve retrieval performance?| Corresponding experiments:z Build a “stemmed” index and compare against “unstemmed” baselinez Expand queries with synonyms and compare against baseline unexpanded queriesQuestions That Involve Users| Example “questions”:z Does keyword highlighting help users evaluate document relevance?z Is letting users weight search terms a good idea?| Corresponding experiments:z Build two different interfaces, one with keyword highlighting, one without; run a user studyz Build two different interfaces, one with term weighting functionality, and one without; run a user studyThe Importance of Evaluation| The ability to measure differences underlies experimental sciencez How well do our systems work?z Is A better than B?z Is it really?z Under what conditions?| Evaluation drives what to researchz Identify techniques that work and don’t workz Formative vs. summative evaluationsDesiderata for Evaluations| Insightful| Affordable| Repeatable| Explainable2Summary| Qualitative user studies suggest what to build| Decomposition breaks larger tasks into smaller components| Automated evaluation helps to refine components| Quantitative user studies show how well everything works togetherOutline| Evaluating the IR black boxz How do we conduct experiments with reusable test collections?z What exactly do we measure?z Where do these test collections come from?| Studying the user and the systemz What sorts of (different) things do we measure when a human is in the loop?| Coming up with the right questionsz How do we know what to evaluate and study?Types of Evaluation Strategies| System-centered studiesz Given documents, queries, and relevance judgmentsz Try several variations of the systemz Measure which system returns the “best” hit list| User-centered studiesz Given several users, and at least two retrieval systemsz Have each user try the same task on both systemsz Measure which system works the “best”Evaluation Criteria| Effectivenessz How “good” are the documents that are returned?z System only, human + system| Efficiencyz Retrieval time, indexing time, index size| Usabilityz Learnability, frustrationz Novice vs. expert usersGood Effectiveness Measures| Should capture some aspect of what the user wantsz That is, the measure should be meaningful| Should have predictive value for other situationsz What happens with different queries on a different document collection?| Should be easily replicated by other researchers| Should be easily comparablez Optimally, expressed as a single numberThe Notion of Relevance| IR systems essentially facilitate communication between a user and document collections| Relevance is a measure of the effectiveness of communicationz Logic and philosophy present other approaches| Relevance is a relation… but between what?3What is relevance?measuredegreedimensionestimateappraisalrelationcorrespondenceutilityconnectionsatisfactionfitbearingmatchingdocumentarticletextual formreferenceinformation providedfactqueryrequestinformation usedpoint of viewinformation need statementpersonjudgeuserrequesterInformation specialistTefko Saracevic. (1975) Relevance: A Review of and a Framework for Thinking on the Notion in Information Science. Journal of the American Society for Information Science, 26(6), 321-343;Relevance is the of aexisting between a and a as determined byDoes this help?Mizzaro’s Model of Relevance| Four dimensions of relevance| Dimension 1: Information Resourcesz Informationz Documentz Surrogate| Dimension 2: Representation of User Problemz Real information needs (RIN) = visceral needz Perceived information needs (PIN) = conscious needz Request = formalized needz Query = compromised needStefano Mizzaro. (1999) How Many Relevances in Information Retrieval? Interacting With Computers, 10(3), 305-322.Time and Relevance| Dimension 3: TimeRIN0PIN0PINmr0r1q0…q1q2q3rnqrComponents and Relevance| Dimension 4: Componentsz Topicz Taskz ContextWhat are we after?| Ultimately, relevance of the informationz With respect to the real information needz At the conclusion of the information seeking processz Taking into consideration topic, task, and context| In system-based evaluations, what do we settle for?Rel( Information, RIN, t(f), {Topic, Task, Context} )Rel( surrogate, request, t(0), Topic )Rel( document, request, t(0), Topic )Evaluating the Black BoxSearchQueryRanked List4Evolution of the Evaluation| Evaluation by inspection of examples| Evaluation by demonstration| Evaluation by improvised demonstration| Evaluation on data using a figure of merit| Evaluation on test data| Evaluation on common test data| Evaluation on common, unseen test dataAutomatic Evaluation ModelIR Black BoxQueryRanked ListDocumentsEvaluationModuleMeasure of EffectivenessRelevance JudgmentsThese are the four things we need!Test Collections| Reusable test collections consist of:z Collection of documents• Should be “representative”• Things to consider: size, sources, genre, topics, …z Sample of information needs• Should be “randomized” and “representative”• Usually formalized topic statementsz Known relevance judgments• Assessed by humans, for each topic-document pair (topic, not query!)• Binary judgments make evaluation easier|Measure of effectivenessz Usually a numeric score for quantifying “performance”z Used to compare different systemsWhich is the Best Rank Order?= relevant documentA.B.C.D.E.F.Set-Based Measures| Precision = A ÷ (A+B)| Recall = A ÷ (A+C)| Miss = C ÷ (A+C)| False alarm (fallout) = B ÷ (B+D)DCNot retrievedBARetrievedNot relevantRelevantCollection size = A+B+C+DRelevant = A+CRetrieved = A+BWhen is precision important?When is recall important?Another ViewRelevant RetrievedRelevant +RetrievedNot Relevant + Not RetrievedSpace of all documents5F-Measure| Harmonic mean of recall and precision| Beta controls relative importance of
View Full Document