CORNELL CS 630 - LECTURE - D1756721

Home> Schools> Cornell University> Computer Science (CS) > CS 630> LECTURE

CORNELL CS 630 - LECTURE

Course Cs 630- Representing and Accessing Digital Information

Pages 8

Download Save

Unformatted text preview:

CS 630 Lecture 1: 01/26/2006Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah ShaparenkoThis lecture focuses on the following topics• Information retrieval fundamentals• Vector Space Model (VSM)• Deriving term weights in VSM1 Information retrieval fundamentals1.1 “Classic” retrieval settingIn the “classic” information retrieval setting, there is a query q which is an the expression of user’sneeds. There is also a “corpus” (body) of documentsC = {d(1), d(2), . . . , d(n)}The goal is to rank (some of) the documents in C by relevance to [the user’s need as indicated by] q.1.2 Quality measuresSo we need some quality measures as well to perform evaluation. Given a query q, suppose a systemoutputs the set O of documents, and let the set of documents that are relevant to the query q be R.There are two well known measures.• Precision answers what proportion of the output documents is relevant to the query. Mathe-matically, precision is computed by|O ∩ R||O|• Recall answers what proportion of the relevant documents was obtained by the system. Math-ematically,|O ∩ R||R|1In modern systems and document collections, there are thousands of relevant documents for aquery. So |R| can, conservatively, range from 1k to 10k while the number of output documentsthat are reviewed is only 10 − 20. Thus the value of recall is really small. For this reason recallnot appropriate. The notion of precision can be changed somewhat as well to get a “high accuracymeasure.” precision @ 10, for example, asks how many of the top 10 output documents are relevantto the query.1.3 Measuring precisionHow do we know which do cuments are relevant to a given query? We can do it in 2 ways.• We can get a lot of data from the “Text REtrieval Conference” (TREC) database or simi-larly, relevance-annotated collections, which we can use as “evaluation corpora” to know therelevance.• We can perform relevance judgments in our system.2 Vector Space ModelThe goal of the system is to get documents whose content is similar to that expressed by q. So wehave two issues here.1. How to represent the content of a document2. How to compute the similarityThe basis for modern systems is the Vector Space Model (VSM). Gerry Salton of the CS departmenthere was the pioneer of this empirical (“ad-hoc”) approach compared to model-driven approaches.2.1 RepresentationLet V be the vocabulary of content-bearing terms.V = {v(1), v(2), . . . , v(m)}2A term can have multiple words; for example, in “data base,” the two words in the phrase hold somemeaning collectively. A term can also be a sequence of digits like “34539753671” to express somemeaning like phone number, ISBN number, etc. For every document d ∈ C and for every v(j)∈ V wecompute a weight for how representative v(j)is of d’s context. So for every document d we get a doc-ument vector−→d = (d1, d2, . . . , dm)T. For example if v(1)=“dogs” and v(2)=“cats,” some document−→d(14)= (8, 1.2)Tmeans that the document’s relevance measure with “dogs” is 8 and with “cats” is1.2. Similarly the vectors−→d(20)= (7, 2)Tand−→d(21)= (1, 6)Tare representations of two other doc-uments. So each doc ument is a vector in <m, and we can measure the “similarity” by the closenessof the vectors in space. Note that d(20)and d(14)are clearly closer to each other than they are to d(21).2.2 RetrievalThere are many different possibilities for score functions for performing ranked retrieval, but theyall tend to measure the “match” between~d and ~q. The similarity here can be meas ured by the innerproduct~d · ~q =mXj=1djqj= k~dk2k~qk2cos(∠(~d, ~q))which is true if k~dk2, k~qk2> 0 and where djrepresents whether term j is good for document d andqjrepresents whether term j is good for the query. A “0” value for the norm could be because of anempty query or document, but could also result from an out-of-vocabulary document.Since k~q k2is identical for all documents, the k~qk2term is irrelevant with resp ec t to ranking. So3we have thatMatch Scorerank= k~dk2cos(∠(~d, ~q))Intuitively, the cos term asks whether ~q and~d point in the same direction. The k~dk2termcaptures:1. whether the document has lots of words2. very representative terms, i.e. “dogmaticity” (which is probably not a word)This setup is the retrieval paradigm that we will use. But, where are all the diterms in thesevectors coming from?3 Term-weighting questionRecall several definitions. The notation djis how much v(j)represents d’s content. Ideally, somebodywould read the documents and create weights for the terms, but if you are a computer and cannotread, then what?The goal is vague, so there isn’t just one right way to come up with term weights. However,by many years of experience, there is a consensus about what should be included when doing termweighting. In fact, there are three important factors for the weighting function.3.1 Representativeness is corpus-depe ndent.Some terms are more important than other terms, and their representativeness depends on thecontext of those terms. For example, given two terms, “the” and “computer,” the term “computer”seems to be more important, but if all the documents in the corpus at hand have the term “computer,”then when comparing documents, it has no use whatsoever.3.1.1 An exampleLet q = “computer modeling of neural processes.” Let d(1)= “computer chess” and d(2)= “sim-ulating neural firing.” Since q shares one term in com mon with each d(1)and d(2), then d(1)andd(2)would be considered equally relevant if we only considered term overlap. However, this doesn’tmake sense. Although there are the same number of matches of terms, the documents should beranked differently. In fact, d(2)is quite relevant even though it happens to have used different terms,namely, “simulating” for “modeling” and “firing” for “processes.” So, we need to factor the rarityof the term v(j)into the weights across the corpus.Note that if d(1)were the only doc in the corpus that mentioned computers, then it might bepossible that d(1)is highly relevant, in that the user might be specifically querying a non-computer-science collection looking for an application of computers.3.1.2 IDFOne way of measuring the rarity of a term is by dividing by the number of documents containing v(j)to get something called “inverse document frequency”. This will henceforth be called simply idf. It4is based

View Full Document


School:
Email:
New Password:
Confirm Password:

CORNELL CS 630 - LECTURE

Sign up for free to view:

Please select your school