CORNELL CS 630 - Lecture Notes - D2739986

Home> Schools> Cornell University> Computer Science (CS) > CS 630> Lecture Notes

CORNELL CS 630 - Lecture Notes

Course Cs 630- Representing and Accessing Digital Information

Pages 9

Download Save

Unformatted text preview:

CS 630 Lecture 4: 02/07/2006 Lecturer: Lillian Lee Scribes: Peter Babinski, David Lin ________________________________________________________________________ Probabilistic Retrieval I. Naïve Beginnings a. Motivations b. “False Start”: A Probabilistic Model without Variation? II. Formulation a. Terms and Definitions b. Making Stuff Computable III. Related Questions I. Naïve Beginnings a. Motivations After a few days of talking about the vector space model and convincing ourselves that such an ad-hoc approach works, it seems strange to take a step back and decide to start over. However, it is clear that in defining the vector space model, we made a few potentially controversial judgments. For example, why does a bag of words feature vector make sense anymore than some other feature vector? Why not use feature vectors that tracks counts of the ten most common words, the lengths of those words, the number of paragraphs, and the number of idioms used in a document? Why do we use this thing called term frequency, TF, and inverse document frequency, IDF, in the similarity function we use to rank the documents by their relation to a query? These are the types of questions that lead researchers to consider a probabilistic model. Credits: The researchers that developed the probabilistic model for information retrieval include Robertson, Spärk Jones, and Walker, among many others Goals: • Gain a better theoretical understanding of information retrieval. • Obtain state-of-the-art performance. 1b. “False Start”: A Probabilistic Model without Variation? Note: For the following discussion, unless otherwise stated, we are assuming that q, the query, is fixed. We start by mentioning that the fundamental principle of our model is to rank documents by the probability that they are relevant, noting that this idea of relevance is very fuzzy. In other words, or rather without using English words, we want to compute the following: (1) P(relevant | document) Only a few seconds into our discussion, we see a glaring problem. Where are the random variables in this equation? For those who do not have much experience with random variables, we mention the following rules: • Use capital letters for random variables • Use lower case letters for values of a random variable (r.v.) Ex. We can declare r.v. X to be the value of a fair coin and can compute the probability P(X = x), where x{}TailsHeads,∈. In response to our lack of random variables, we can assign a random variable to the relevance, R, and the document, D, and rewrite equation (1) as the following: (2) P(R = r | D = d) Now that we have random variables, we have to decide what possible values they can take. For example, we can let R take on the values r{}noyes,∈ or r{}1,9.,,2.,1.,0 K∈. Whether we use the former or the latter, there is still no variation, because in the former case, we already know whether a document is either relevant or not. In the latter case, there is still no difference, as we already know if a document has a relevance rating of .5 or 0. We also need to define what possible values D can take. We can either use d to represent a document in a given corpus, C, or as a document from the space of all possible documents. We also need to take care that P(D = d) > 0 for all documents d, regardless of which option we choose. Regardless of our choices here, we still have not managed to introduce variance into our probability formula, (2). 2Let’s consider a few of the possible sources of variance that we could add: • Variation in a specific user’s judgment of whether a document is relevant or not. The user may change his mind about whether a document is relevant based on when you ask him. For example, the user may use the same term to mean different things at different times. • Variation over different users [Maron and Kuhns (1960)]. Each user may have a different opinion on whether a document is relevant or not to a given query. The main concern is how we would characterize this variation without some model of human preferences and beliefs. • Variation because of the document representation [Robertson and Spärck Jones (1976)]. Some documents with a certain representation are relevant, while some are not. This idea stems from the fact that if we use a document representation that incorporates terms without the various language constructs or context they are found in, then some documents with a certain feature vector may be more relevant to a query than others with the same feature vector. For example, a document containing the phrase “the White House is” has a different topic than a document that only contains the phrase “the house is white”. For lack of a better choice, and since using reduced document representations is fairly standard practice in IR, we will try the route of introducing variation in relevance due to the document representation. Still, is there another source of variation we could introduce that we aren’t considering? II. Formulation Note: For the following discussion, unless otherwise stated, we are assuming that q, the query, is fixed. a. Terms and Definitions Previously, we decided that we would introduce variation into our probabilistic retrieval model by allowing a document with a certain representation to be relevant where another document with the same representation is not relevant. This feature of our model motivates our ensuing discussion on “binning” the documents into classes. We will represent each document with a set of attributes: ()TmAAA ,...,1=r where Aj is an r.v. for attribute j’s value Therefore, for each document d, there is an associated vector: ()()()()Tmdadada ,...,1=r 3Ex. ()daj = the number of times v(j) occurs in d or ()daj = else 0din vif 1)( j [or even am+1(d) = length(d)] We can already see that there may be some problems with ambiguity in this representation. Ex. Take the documents “white house” = d(1) and “house white” = d(2). If the query is asking for documents dealing with U.S. politics, then only document (1) is relevant, even though ()())2()1( dadarr=. Using our attribute vectors, we can rewrite our probability formula from equation (2) as the following: (3) P(R = y | )( daArr= ) Some immediate concerns arise however. Wasn’t our choice of features for this representation of the document

View Full Document


School:
Email:
New Password:
Confirm Password:

CORNELL CS 630 - Lecture Notes

Sign up for free to view:

Please select your school