DOC PREVIEW
CORNELL CS 674 - CS 674 Lecture Notes

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 674/INFO 630: Advanced Language Technologies Fall 2007Lecture 8 (Part 2) — September 20, 2007Prof. Lillian Lee Scribe: Vasumathi Raman + additional edits by L. LeeNote Additional edits were made by L. Lee to correct a lecturer-induced problem, and to makePart 2 consistent with Part 1.4 ApproximationGiven the exponentially decreasing numerator and denominator and the upper and lower limitsproved, and assuming that Tj= y and Rq= y are positively correlated, the function underconsideration increases monotonically with d[j] to an asymptote at the idf . But our function stillhas too many unknowns, so the easiest way out is to try and approximate it. Do we know a functionthat has this curve? Robertson and Walker did [RW94]. They used the functiond[j]k + d[j]× idf (1)for some tunable constant k. This function be haves correctly in the “right-hand” limit - it increasesmonotonically to an asymptote at idf . While this may seem somewhat like a magic trick, it doeswork in practice.Notice that this scoring function has the form tf × idf . Thus we have a theoretically (arguably)justified derivation of a score function that has both tf and idf terms. Cynics may complain aboutthe hand-waving and the ad hoc assumptions that were made along the way. Nevertheless, thisfunction has been used in practice and shown to work, and is a good example of practice motivatedby theory, in which a relatively sophisticated mixture-of-Poissons model was found to support thetf × idf intuition.4.1 NormalizationRecall that we have thus far assumed that documents have the same length due to our use ofPoisson distributions. We still need some sort of normalization to account for varying documentlengths. If we suppose that the constant k in (1) is “appropriate” for documents of average length,then we can used[j]k ×length of document daverage document length+ d[j]× idf (2)4.2 BM25There are several more sophisticated variations o f this score function – some of them can beenfound in [Sin01]. Robertson et al [RWHB+92] introduced a family of such scoring functions called1“Best Match ij” or “BMij” where i, j ∈ {0, ..., 9}, which differ slightly from each other in theirparameters and constants. BM25, which stands for “Best Match, version 25”, is the most famousof these functions, and very widely used. It is defined as follows:Xt∈Q,DlnN − df + 0.5df + 0.5·(k1+ 1)tfk1((1 − b) + bdlavdl) + tf·(k3+ 1)qtfk3+ qtf(3)Wheretf is the term’s frequency in the documentqtf is the term’s frequency in the queryN is the total number of documents in the collectiondf is the number of documents that contain the termdl is the document length (in bytes), andavdl is the average document lengthThe document length normalization term in BM25 is k1((1 − b) + bdlavgdl). This looks a lot likepivoted document length normalization [SBM96]. Selecting the average old normalization as thepivot, the final normalization factor reduces to (1.0 − slope) + slope ×old normalizationaverage old normalization.While in pivoted document length normalization we use the old and average norm to get the newone, here we use the current and average document lengths.k1controls the influence of tf on the overall function. As k1grows, the tf term in the function grows,influencing the ranking more. b controls length normalization. As b grows larger, the normalizationdepends more ondlavgdl. Thus, the larger the value of b, the more we penalize documents of longerlength. These parameters can be tuned based on properties of the corpus. b is usually set at 0.75and k1is between 1.0 and 2.0.The odd-looking last term with the qtf seems to be a res ult of treating the query as a document.The parameter k3plays the same role for the query as k1does for the document. If k3is higher,qtf (the frequency of term t in the query q) has greater influence on the ranking. We can thus tunek3based on how much importance we want to give to repetition of terms in the query.4.3 Critique of the Probabilistic Model of IRLet us now take a step back to evaluate the Probabilistic Model (PM) approach. The probabilisticframework is flexible. It allows for different levels of mathematical sophistication depending onhow much information is available. For example, we were able to model term frequency in differentways (binary vs. following a Poisson distribution) depending on how much information we madeavailable to the model (or made assumptions about). We were able to introduce new information ifwe had some knowledge about the relevance variable Rqthrough human evaluations, for example.We pay for this flexibility with the prese nce of more unknowns in the scoring function as the modelgets more sophisticated. We also had to make a lot of rather precarious approximations to simplifythe RSJ model.Note also that the query q was not handled explicitly within the probabilistic framework. Our nextapproach differs from it in this respect.25 Language Modeling Approach – the Third Framework of IRThis approach is due to [PC98]. The motivation here can b e found in [LZ03], and involves explicitlymodeling both document selection and query generation.We define:Q : a random variable over queries (based on the user)D : a random variable over documents (based on the author/corpus)R : a (binary) relevance variableRewrite the PM initial score function, i.e. P (Rq= y|~A =~d) asP (R = y|D = d, Q = q) (4)Here we are treating R as a random variable, but it may be argued that, for a given document andquery, the relevance of the document to the query is either “yes” or “no”. We take the “variance”,i.e. what makes this an interesting probability distribution (not just 0 or 1) to be due to the user(s).This differentiates the score function from attribute binning.Now we can get Q and R on the same side of the conditional by performing a Bayes flip on D andR. This allows us to rewrite (4) asP (D = d|R = y, Q = q)P (R = y|Q = q)P (D = d|Q = q)(5)Notice that P (R = y|Q = q) is document independent and can be taken out of the function. If weassume D to be independent of Q, we can replace P (D = d|Q = q) with P (D = d). We claim thatthis is a reasonable assumption, as it amounts to stating that the document generation/selectionprocess is separate from query generation. We can now rewrite (5) asP (D = d|R = y, Q = q)P (D = d)(6)This is exactly what we had for our scoring function in the probabilistic model, i.e.P (~A=~d|Rq=y)P (Rq=y)P (~A=~d)– just the


View Full Document
Download CS 674 Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CS 674 Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CS 674 Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?