CORNELL CS 630 - Lecture 4: Probabilistic Retrieval - D886744

Home> Schools> Cornell University> Computer Science (CS) > CS 630> Lecture 4: Probabilistic Retrieval

CORNELL CS 630 - Lecture 4: Probabilistic Retrieval

Course Cs 630- Representing and Accessing Digital Information

Pages 7

Download Save

Unformatted text preview:

CS630 Lecture 4: Probabilistic RetrievalLecture by Lillian LeeScribed by Randy Au, Nick Gerner, Blazej KotFebruary 7, 20061 Probabilistic RetrievalThe Vector Space Model is essentially an ad hoc approach to information retrieval. If we wantedfirmer theoretical ground, we will have to take a different approach using a priori assumptions. Onesuch attempt is derived here.1.1 NotationCapital letters represent random variables. Lower case letters represent the values taken by therandom variables.2 The DerivationAssumption 1: the query, q, is fixed to keep the derivation simpler.Our ultimate goal is to find a way to rank documents according to the probability of their relevanceto the query, as opposed to the VSM model where relevance is simply how “close” q is to d in avector space.P (relevant|doc)We therefore want to compute:P (R = r|D = d)where: R is a random variable of relevance, and comes from a set such as: {y, n}, {0,0.1,. . . ,1}, andD is a random variable for documents, which might range over the set of documents in the corpus,or all possible documents, real or imagined.Note: Throughout the derivation P (D = d) > 0 holds true. Otherwise we would condition on anevent of probability ze ro, and the universe explodes.Because the use r is most interested in relevant documents, we want to compute:P (R = y |D = d)1Issue #1: For a given document, and a given query, the probability of relevance of that documentshould be a single value in {0,1}. There’s no randomness in the situation. We need a way to justifyinjecting random variables into the equations.Possible Solution #1: We could say that, for a given user, on a given day, they might judge thedocument relevant, while on another day, the same document for the same query is irrelevant, andthis process is random. OR, over all users, some percentage will find the document relevant whilesome won’t.The problem with this is that this puts human preferences into the picture.Possible Solution #2: We can represent uncertainty as a function of document representation.If the user looks at collections of documents, binned in some way, some fraction of the bin wouldbe relevant, and this would be a random probability.For this derivation, we will proceed using the second solution to avoid the need to model humanspreferences.Let A be a ve ctor of attribute variables:A = (A1, A2, . . . , Am)TThen, for each document d, there is an associated attribute vector a(d):a(d) = (a1(d), a2(d), . . . , am(d))TFor example, aj(d) may be the number of times that the term v(j)appears in d. Attributes can beas com plex as desired, i.e. am+1= length(d).Assumption #2: In this derivation, we assume that Ajcorresponds to the appearance of termv(j)in a document.So, we want to computeP (R = y |A = a(d))We still have no information about relevancy. How can we proceed further, pushing the question ofrelevancy as late in the process as we can?Assumption #3: Seeing a given attribute vector, a(d), has a very low probability. (We canincrease the probability somewhat by keeping a(d) short, but we would lose information.) In anyevent, it seems reasonable to assume thatP (R = y ) > P (A = a(d))This assumption suggests that we apply a “Bayes Flip”, by applying Bayes Rule, to flip the condi-tional.2P (R = y |A = a(d)) =P (A = a(d)|R = y)P (R = y)P (A = a(d))With respect to computing a score for a document d, P (R = y), being a constant, falls out.rank=P (A = a(d)|R = y)P (A = a(d))Remember that P (A = a(d)) is a tiny value.NOTE: In the classic presentation of this derivation, an odds ratio is used, and so thedenominator term P (A = a(d)) is eliminated:P (A = a(d)|R = y)P (A = a(d)|R = n)In estimation, it is then assumed that for P (A = a(d)|R = n), most documents are n’t relevantanyway, and soP (A = a(d)|R = n) = P (A = a(d))which is what we have before.Assumption #4: The attributes are conditionally independent, and terms are independent.Issue #2: Cooper in “Some inconsistencies and misnomers in probabilistic information retrieval”in SIGIR ’91, finds counterexamples that yield logical errors when this assumption is used. Theformula we use is a version of a formulation that he was unable to find a counter example for.= αQjP (Aj= aj(d)|R = y)QjP (Aj= aj(d))α falls out since it is a constant and we are ranking.Notice: The product ranging over all term-attributes within a document can allow documentscorresponding to unobserved attribute vectors to still have non-zero probabilities.We have just about done all that we can to modify the equation while putting the relevancy questionoff. We are now out of tricks and so must deal with it. The simplest step would be to factor whetherterms in the query do or do not appear in a do cume nt, given that the query is one clue regardingrelevance.=when term is in queryz }| {Yj:qj=1P (Aj= aj(d)|R = y)P (Aj= aj(d))·Yj:qj=0P (Aj= aj(d)|R = y)P (Aj= aj(d))| {z }when term is not on query3Notation:qj=(1 if v(j)is in q,0 otherwiseAssumption #5: Non-query term-attributes have the same distribution over relevant documentsas over all documents.Issue #3: This would probably be true for non-query terms that aren’t related to the query terms,but is not quite so true for non-query terms that are related to the query terms.This assumption allows us to state for terms not in the query:P (Aj= aj(d)|R = y) = P (Aj= aj(d))Which allows us to c ancel out the non-query terms in our equation, leaving us with:assumption=Yj:qj=1P (Aj= aj(d)|R = y)P (Aj= aj(d))And that’s as far as we got in this lecture!(To be continued. . . )3 QuestionsOne assumption we made during the first part of our derivation of the probability of relevancewas that the distribution of non-query attributes have the same distribution over relevantdocuments as over non-relevant documents. Let’s examine this assumption.In the following, for simplicity assume that Ajcorresponds to term presence/absence, ratherthan term counts.3.1 Question 1What does this assumption mean, especially with respect to our derivation? I.e., why wouldwe make this assumption and how does it simplify our derivation? In addition to any math-ematical formulae, give a short text answer explaining each probability term and how theyinteract.3.2 Question 2Given the following vocabulary of nouns and verbs (and their derivatives, e.g. since dog isin the language, then so would dogs, or chase gives us chases, chased, etc.):V = (cat, dog, mouse, bone, seeds, food, toy, chase, hunt, play, forage, eat, sleep, lay,

View Full Document


School:
Email:
New Password:
Confirm Password:

CORNELL CS 630 - Lecture 4: Probabilistic Retrieval

Sign up for free to view:

Please select your school