CORNELL CS 674 - Lecture 5 Guide: An Introduction to Probabilistic Retrieval - D510178

Home> Schools> Cornell University> Computer Science (CS) > CS 674> Lecture 5 Guide: An Introduction to Probabilistic Retrieval

DOC PREVIEW

CORNELL CS 674 - Lecture 5 Guide: An Introduction to Probabilistic Retrieval

School name Cornell University

Course Cs 674- Advanced Language Techologies

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS674/INFO630: Advanced Language Technologies,Fall 2007, Lecture by Lillian LeeLecture 5 Guide: An Introduction to ProbabilisticRetrievalAlex Chao David CollinsSeptember 11, 20071 IntroductionIn the last lecture, we finished our discussion of pivoted document-length normalization.Recall that the reason behind that technique was that information retrieval (IR) systemsusing the vector space (VS) model with L2length normalization were shown to be biasedtoward retrieving short doc uments. Simply altering term frequency and inverse documentfrequency weights themselves (by coming up with the “correct” term weights based onindividual terms alone, as opposed to collective term statistics) would not be sufficient toalleviate the bias, because you need information about the document as a whole in order tonormalize the document vectors correctly.Pivoted document-length normalization and the VS model paradigm as a whole are veryempirically driven. In this lecture, we introduce the probabilistic retrieval paradigm ofinformation retrieval, which is a much more theoretically inclined perspective. It has beendeveloping in parallel to the VS paradigm, but, as we will see in later lectures, both seemedto have come to many of the same conclusions. We will be discussing the Robertson &Sp¨arck Jones (RSJ) variant of probabilistic retrieval in this lecture. For a brief history ofthis topic, se e C.J. van Rijsbergen’s paper entitled The emergence of probabilistic accountsof information retrieval (2005). As we discuss the probabilistic paradigm, it is importantthat we compare and contrast with the VS model, both in spirit and in the techniques used.Although probabilistic IR comes primarily from principled and theoretical ideas, it usesboth statistical estimation and empirical substitution at times.2 SetupAssume we have a set of m attribute random variables Ajthat each correspond to a docu-ment characteristic. Unlike in the VS model, each of the attributes here w ill generalize toall sorts of ways to describe a document, not just the presence of terms (at least for now).Here are two examples :• A1= yes ≡ document contains both “car” and “Ithaca”.• A17= 14 ≡ document is 14 words long.1We are intentionally trying to be as general as possible, so that we avoid unnecessarilylimiting our model in scope. It is important for the purpose of our presentation, however,that the range of each attribute (Aj) contains a distinguished value of 0 (or no), meaningthe document does not exhibit this attribute. It will be clear why we need this later in thelecture.We will represent a document d as~d, where d[j] is the value of Ajfor d. Note that(ignoring normalization issues) the VS model is a special case of this, where each of theattributes refers to a term. As far as scoring a document for “relevance”, we will, again,intentionally keep the semantics slightly undetermined so that we do not unnecessarily losegenerality.Let Rqbe a random variable with respect to query q. Rqrefers to how well a documentis relevant to the query q. For now, we will assume Rq∈ {y, n}. This range, however,generalizes easily; it does not need to be binary.We will thus rank the documents using the following probability:Pr(Rq= y |~A =~d)This choice requires some examination. Namely, w hy is this quantity described as aprobability if a particular document can be classified as either relevant or not (probabilitiesof 1 or 0, respectively)? The probability is in fact due to, perhaps among other things, thefollowing factors:• The set of attributes is not uniquely specifying. That is, the attributes may effectivelylead to classifications of the documents in which a single “bin” may contain bothrelevant and non-relevant documents.• There may be variation among users who judge the relevance of a given documentdifferently, or there may even be variation for a single user whose judgment changeswith time or depends on a specific context.It should be noted that Robe rtson & Sp¨arck Jones do not use Pr(Rq= y |~A =~d), butinstead use the following ranking function:log Pr(Rq= y |~A =~d)Pr(Rq= n |~A =~d)!A claim was made in lecture that both ranking functions produce the same result, withthe former having more convenient mathematical properties. In any case, our overall claimis that this general probabilistic ranking method better matches the retrieval goal, thoughthere are objections (see Gordon & Lenk ’92 ).3 DerivationAssuming we rank using Pr(Rq= y |~A =~d), there are still several challenges to consider,the first being that there are no relevance labels for the documents. Also, there is not muchinformation regarding the particular attribute vectors.We start by performing a Bayes’ “flip” on the rank function in order to c ondition on thevariable with fewer possible values (or “bins”):2Pr(Rq= y |~A =~d) =P r(~A =~d | Rq= y) Pr(Rq= y)Pr(~A =~d)This flip appears to be worse than what we started with, since we have more unknowns,until we take into account document independence. A term of the equation is said to bedocument independent if it contributes equally to every document’s score, thereby making itindependent of any given document so that its omission still preserves the overall ranking ofthe documents. Such terms can be dropped from the e quation without altering the relativeranks of the documents. In this case, the term Pr(Rq= y) is document independent. Weare left with:Pr(~A =~d | Rq= y)Pr(~A =~d)(1)At this point, we would like to break this function down further, as it still containsvariables whose values are not observed (e.g. Rq= y). In particular, we try to assume aconditional independence between the elements of~A and Rqsimultaneously with an indepen-dence among the elements of~A. However, it can be argued that such an assumption implieslogical inconsistencies (Cooper ’95 ), so we turn instead to a type of linked independenceassumption, w hich enables us to decompose the ranking function as follows:kmYj=1Pr(Aj= d[j] | Rq= y)Pr(Aj= d[j]), k > 0 (2)The constant value k is a factor that accounts for the skew from independence that thedecomposition introduces. Given that we are concerned only about ranking our documents,the term falls out of the equation.We continue trying to simplify the ranking function, as we are still hindered by therelevancy component. We proceed by factoring the quantity based on the appearance orabsence of the terms in the query, since the query is the only “clue” we have

View Full Document