CORNELL CS 630 - Representing and Accessing Digital Information - D1161410

Home> Schools> Cornell University> Computer Science (CS) > CS 630> Representing and Accessing Digital Information

CORNELL CS 630 - Representing and Accessing Digital Information

Course Cs 630- Representing and Accessing Digital Information

Pages 11

Download Save

Unformatted text preview:

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006 1 Scribes: Gilly Leshed, N. Sadat Shami Outline 1. Review 2. Mixture of Poissons (2 Poisson) model 3. BM25/Okapi method 4. Relevance feedback 1. Review In discussing probabilistic models for information retrieval we covered the Robertson and Spärck Jones (1976) scoring function (aka the RSJ model). Our own version of the RSJ scoring function for a document d is: which is equivalent for ranking purposes to the log thereof: Here we assumed binary Aj and following Croft & Harper assumed that P(Aj = aj(d) | R = y) is the same for all query terms [1]. This yielded an idf term weighting scheme. We then tried something more complicated and came up with the ‘Isolated Poisson model’ where adopting some reasonable assumptions allowed us to obtain a tf x idf scheme. We will tweak the probabilistic model even further with the ‘Mixture of Poissons’ model (commonly known as the 2 Poisson model) and improve performance. 2. Mixture of Poissons model In order to understand the assumptions behind the ‘Mixture of Poissons’ model we need to comprehend the idea of ‘eliteness’ originally proposed by Harter [2]. Occurrences of a term in a document have a random or stochastic element, which nevertheless reflect a real but hidden distinction between those documents that are “about” the topic represented by the term and those that are not. Those documents that are “about” this topic are described as “elite” for the term. The ‘Mixture of Poissons model’ assumes that the distribution of within-document frequencies is Poisson for the elite documents, and also (but with a different mean) for the non-elite documents. We assume term v(j) corresponds to some topic. The question then arises, why does v(j) occur in document d?CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006 2 There are two cases. Case 1: d is about v(j)’s topic ≡ d is ‘elite’ for v(j) Case 2: occurrence is incidental. We then look for the degree of match between the query and d. Within the RSJ model, we can introduce a binary random variable Tj(d) ≡Tj = { The fact that P(B) = ∑=zBzZP ),( , this conveniently allows us to introduce random variable Tj into RSJ. P(Aj = a | R = y) = ∑====},{)|,(nytjjyRtTaAP = ∑======},{)|(),|(nytjjjyRtTPyRtTaAP This is essentially a generative model wherein the following independence assumption is justified. Independence assumption Aj is conditionally independent of R given Tj, since if the author knows what the document is about, whether they use a term does not depend on whether the document is relevant to the user. Thus P(Aj = a | R = y) can be re-written as ∑=====},{)|()|(nytjjjyRtTPtTaAP This allows us to drop one occurrence of a random variable. Let P(Aj = a | Tj = y) ~ Poisson (τ), where τ is the expected number of occurrences of term v(j) in a document on v(j)’s topic for a given document length. Similarly, P(Aj = a | Tj = n) ~ Poisson (µ), for off-topic documents And µ < τ, that is, if the document is about the topic that is associated with a term, we expect to see the term in the document more frequently than if the document is not about the topic. y, d is about v(j)’s topic n, o.w.CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006 3 But we don’t know µ, τ, or P(Tj = y | R = y). We also get an additional unknown variable via P(Aj = a). If we plug all this into the RSJ model, we get a very big expression that is a function of aj’s. As aj ∞→ it’s monotone increasing to an asymptote, which if Tj = y and R = y were correlated, would be the RSJ weight (idf). Thus the unwieldy formula can be used to suggest a much simpler formula. We can use this to approximate the weighting function as ⎟⎟⎠⎞⎜⎜⎝⎛+ ajKajx idfj where K is an unknown constant. Length effects We assumed fixed document length in order to use the Poisson as a suitable model. According to Robertson & Walker, there are at least two reasons why documents may vary in length [3]. Some documents simply cover more topics than others. An opposite view is that some documents are longer because they simply use more words. It seems likely that real document collections contain a mixture of these effects. The simplest way to take document length into account is to normalize for document length. This can be expressed as k = k1()⎟⎟⎟⎠⎞⎜⎜⎜⎝⎛+−lengthdocumentaveragedlengthbb)(1 where k1 and b are free parameters. 3. BM25/Okapi model The symmetry of the retrieval situation as between documents and queries suggests that we could treat within-query term frequency in a similar fashion to within-document term frequency. A query term weighting function could thus be written as: jjqtfkqtf+3 where k3 is another constant. Combining all these leads us to the BM25/Okapi weight scheme. Virtually all the major systems in TREC now use the “BM25/Okapi formula”. The following table taken from [4] shows the Okapi scoring function.CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006 4 tf = the term’s frequency in document qtf = the term’s frequency in query N = the total number of documents in the collection df = the number of documents that contain the term dl = the document length (in bytes), and avdl = the average document length Okapi weighting based document score1: ∑∈++++−+++−DQtqtfkqtfktfavdldlbbktfkdfdfN,3311)1(.)))1((()1(.5.05.0ln k1 (between 1.0-2.0), b (usually 0.75), and k3 (between 0-1000) are constants. 4. Relevance Feedback Let’s rethink the entire ‘Probability Model’ approach. Looking back on it, the lack of relevance information has been a big stumbling block. Can we make the hidden variables not hidden? The relevance feedback approach proposed by Rocchio in 1965 provides us an avenue for this [5]. There are two ways to go about this: (a) an explicit approach, and (b) an implicit approach. The explicit approach involves obtaining query dependent relevance labels from the user. The drawback of this approach is that it is expensive for the user. The implicit approach involves using clickthrough data, email, document and webpage views etc. The drawback of this approach is that there are many privacy concerns about providing access to personal documents and web browsing behavior. Perhaps we will

View Full Document


School:
Email:
New Password:
Confirm Password:

CORNELL CS 630 - Representing and Accessing Digital Information

Sign up for free to view:

Please select your school