CORNELL CS 630 - LECTURE - D1761815

Home> Schools> Cornell University> Computer Science (CS) > CS 630> LECTURE

CORNELL CS 630 - LECTURE

Course Cs 630- Representing and Accessing Digital Information

Pages 6

Download Save

Unformatted text preview:

CS630 Lecture 1: Fundamentals of IRDate: January 26th, 2006Lecturer: Lillian LeeScribes: Ari Rabkin and Victoria KrafftAdministrative note: This lecture largely overlaps with CS430, since that course is not aprerequisite. However, this should not be typical of the course.Contents1 Classic Information Retrieval 12 Methods of Evaluation 22.1 Classic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Modern Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Data for Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Vector Space Model 33.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3 Term Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Question 51 Classic Information RetrievalInformation Retrieval is a fairly familiar task to most of us. We all probably use Google atleast once a day on average, and are therefore familiar with the classic IR retrieval task:We have some query, q, which is an expression of the user’s nee ds. The system compares itwith some corpus of documents, C = {d(1), . . . , d(n)}. Corpus is Latin for body; the corpusin IR is the body of text within which we’re trying to find relevant documents.The goal of the system is to rank (some of) the documents in C by relevance to (theuser’s needs as expressed by) q.Typically, IR practitioners will omit the phrase “the user’s needs, as expressed by”, andjust refer to “relevance to a query”. Computers cannot read minds, of course, and so thequery is generally all the system can use (at least, in the classic, non-relevance-feedbacksetting). However, it’s important to remember that ultimately, the goal is to return things1relevant to the user, not just related to the terms in the query.2 Methods of E valuationBefore discussing techniques, we will start by mentioning the techniques we use to evaluatethem. A general word of advice: Don’t obsess over standard metrics that may not be suitableto our task, but do think about what we’re trying to accomplish, and think about how we’dmeasure it.2.1 Classic methodsFirst, we’ll look at the “old style” quality measures of classic IR. These measures werecommon in older work, but newer ones have been developed recently, for reasons we’ll discussin a bit.To back up for a bit: In information retrieval, we have some corpus C of documents, andsome query q. There is some set R of documents in C that are relevant to the query; theinformation retrieval system returns some other set S.Figure 1: A graphic representing the one retrieval outcome. Ideally, we want the two setsinside C to be the same.2.1.1 PrecisionOne thing we could measure is what fraction of the returned documents are relevant: thisis called precision, and we can write it as|STR||S|. We can also think of this as the likelihood2that a document is relevant, given that the system returns it.2.1.2 RecallAlternatively, we could ask about|STR||R|, the fraction of relevant documents that are re-turned. This is called recall.2.2 Modern MethodsThose are the classic measures of IR, but they’re not very useful for a search engine onthe web. For most queries on the web, many hundreds or thousands of documents maybe relevant, and we probably never look at more than the first few pages of results; we’renot interested in whether there are 800 or 1000 possibly-relevant documents returned, if wefound what we wanted in the top 20.Some of this may be an artifact of how web search engines present results, but even so,it’s hard to get excited about recall these days. Even precision is a confusing measure insome sense; if the system returns 1000 results, we probably care a great deal about whetherthe 30 documents that are most relevant are on the first results page, or the 20th.As a result, it’s becoming more common these days to use metrics like “precision at 5”,or “precision at 10”; the fraction of the first 5 or 10 documents that are relevant. These arecalled high accuracy measures. Because these more accurately reflect the desired behaviorfrom the system, these measures may be more appropriate these days.2.3 Data for ExperimentsHow would we measure precision? We just ask a human “is this document relevant?”!The standard data sets in IR, such as the TREC corpora, include sets of queries, and thedocuments are coded with relevance judgments with respect to these queries. TREC is aseries of annual text retrieval conferences; among other things, these featured “bake-offs”between search engines; the search systems would be tested on these standard corpora. Thesedata sets are wide ly available, and free to use, and are still standard at SIGIR conferences(SIGIR is the main conference for information retrieval).3 Vector Space ModelWe’re trying to find documents with content similar to that expressed by the query. Thisposes two questions: how to represent content, and how to measure similarity?The classic techniques use what’s called the Vector Space Model (VSM); this is alsothe basis for modern systems. It was developed largely by Gerry Salton, here at Cornell.3Cornell has a reputation as a theory school, but the VSM is actually highly empirical,and notvery theoretical. It is an “ad hoc” approach, where “ad hoc” should be taken in the literalsense: “for this”; in contrast to mo del-driven approaches, the details of VSM approaches arebased largely on intuition and experimentation. It’s a s imple paradigm, and the computingoperations are efficient.3.1 RepresentationThe VSM represents documents as vectors in an abstract space. Suppose we have somevocabulary V = {v(1), v(2), . . . , v(m)}, where each v(j)is a content bearing term. Content-bearing terms are not necessarily the same as lexical space-separated words. We might want“data base” to be a single vocabulary term, and we might not want numbers like 4538 to beconsidered vocabulary terms.For every document d ∈ C, and for every v(j)∈ V , we compute a measure of how repre-sentative v(j)is for d.We can then represent d as a document vector~d …

View Full Document


School:
Email:
New Password:
Confirm Password:

CORNELL CS 630 - LECTURE

Sign up for free to view:

Please select your school