CORNELL CS 630 - CS630 Lecture 3 - D2791823

Home> Schools> Cornell University> Computer Science (CS) > CS 630> CS630 Lecture 3

CORNELL CS 630 - CS630 Lecture 3

Course Cs 630- Representing and Accessing Digital Information

Pages 6

Download Save

Unformatted text preview:

CS630 Lecture 3 2 Feb 2006 1/11 COM S/INFO 630: Representing and Accessing [Textual] Digital Information Lecturer: Lillian Lee Lecture 3: 2 February 2006 Scribes: Siavash Dejgosha (sd82) and Ricardo Hu (rh238) Pivoted Length Normalization I. Summary Document length normalization schemes attempt to eliminate the advantage that long documents have over the shorter documents under a certain scoring scheme. However, Singhal et al’s paper suggests that normalization over-penalizes long documents when compared to the actual document distribution. In particular, retrieval using cosine normalization tends to retrieve shorter documents more often and longer documents less often than the distribution of lengths of relevant documents tends to suggest. Pivoted Document Length Normalization (PDLN) modifies the standard cosine normalization (CDLN) scheme to overcome this situation. PDLN seeks to transform CDLN to better match the retrieval and true relevance probabilities of a document with respect to length. Notation d: a document in the corpus tf: term frequency idf: inverse document frequency w: weight of a particular term in the vocabulary w = tf * idf 22221...)(mwwwdnorm +++= II. Review Consider the distribution of relevant documents according to document length. We can construct a curve of the “probability” vs. length graph of the relevant documents in a corpus versus the presumed relevant documents retrieved by a standard cosine normalization scheme:CS630 Lecture 3 2 Feb 2006 2/11 Observations I. Neither curve is linear despite cosine normalization being a linear function of the document L2 norm. This is because the underlying relevance distribution is apparently non-linear. II. The shape of the cosine normalization curve does not match exactly that of the relevance curve. The shape of the curve is determined by the IR system’s representation and ranking system of documents which is not necessarily a completely accurate function of the document length. a. For lengths < crossing point, the cosine curve is above the relevance curve. That is, short documents are more likely to be retrieved than their “true” likelihood of relevance. b. For lengths > crossing point, the cosine curve is below the relevance curve. That is, long documents are less likely to be retrieved than their “true” likelihood of relevance. III. Reliability of statistics over different lengths: Suppose there are only 2 documents at length 13,000 and 5,000 documents at length 13,001. The statistics for length 13,001 are much more reliable than the statistics at 13,000. We adjust for this by creating 1000 document bins. Relevance vs Retrieval with cosine normalization cosine norm“true”relevancecrossing point document length“probability” of relevance/retrieval Graph Construction 1. Sort the set of all documents in corpus by length. 2. Create bins, Bi, with 1,000 documents each in sorted order. 3. The length of a particular bin is the median length of all documents in that bin: len(Bi) = median(len(d), for all d ε Bi) 4. Each bin is assigned a probability by the ratio: relevant doc’s in bin / total number of relevant doc’s in the corpus: prob(Bi) = P(d ε B | d is relevant)CS630 Lecture 3 2 Feb 2006 3/11 III. Derivation of Pivoted Length Normalization1,2 Motivation: From observation II above, norm(d) should be lower for longer documents because longer documents are being penalized too much by cosine normalization. Solution: Map the old normalization to a new normalization, norm’(d) with an appropriate value for longer documents. We use a linear mapping: [1] norm’(d) = m’ norm(d) + b’, where m’ and b’ are parameters This equation defines pivoted document length normalization in terms of cosine normalization. The pivot, p, is the norm of the document length corresponding to the crossing point in the previous graph. At this point, norm’(dp) = norm(dp). Note that there is some leeway in specifying dp. It is not clear if they would have used the document that has the median length in the bin that represents the crossing-point, or if they would have used all the relevant documents in the bin represented by the “crossing-point” to calculate the dp. We do not need an exact specification of dp because the graph above is not plotted from actual data. Rather, it is intended to show how the “tilting” transformation works. We can just consider dp to be a representative document of the length of approximately equal to the value of the crossing point. 1 Singhal, Salton, Mitra and Buckley. "Document length normalization." Information Processing & Management. vol 32 #5, Sep 1996, p619-33 2 Singhal, Buckley, Mitra. “Pivoted Document Length Normalization.” ACM SIGIR. 1996 Cosine Normalization FactorPivoted Normalization FactorCosine NormalizationPivoted NormalizationPivotα slope = tan(α)CS630 Lecture 3 2 Feb 2006 4/11 At the pivot, norm’(dp) = norm(dp) = p, and using [1] above we get normd’(dp) = m’ norm(dp) + b’, where m’ = tan(alpha) Æ p = m’ p + b’ [2] b’ = p(1-m’) Using [1] and [2], we can write [3] norm’(d) = m’ norm(d) + p(1-m’) Removing one parameter Note that for any positive constant α with respect to a document d: )()( dscoredscorerank•=α since the constant term does not matter for ranking. We can use this equivalence to define a normalization function to simplify the constant term: )(')'1(1)('' dnormmpdnorm⎟⎟⎠⎞⎜⎜⎝⎛−= [4] 1)()'1(')('' +−= dnormmpmdnorm Let’s try to fix one of the parameters and optimize the other one. How about a good p? Let p = normdnorm =)( , the average normalization over all documents We claim there exists an m’’ such that )'1(')''1(''mpmmnormm−=−. Solving for that m’’ we arrive at: [5] ⎟⎟⎠⎞⎜⎜⎝⎛−×+−×=pmmnormpmnormmm)'1('1)'1(''' Now we define another new norm’’’(d) using [4] and [5] 1)()''1('')(''' +−= dnormnormmmdnorm Multiply by (1-m’’) to define normiv(d) which is equivalent for ranking purposes: [6] )''1()('')( mnormdnormmdnormiv−+= Notice that for a document of average length, normdnormiv=)( so that normiv(d) = 1. That is, a document of average length is already of “appropriate” length and does not need to be scaled under this scheme. Assuming the linear correction, we have reduced the two-parameter search problem to a search for just one parameter. All the newly defined norm’s are

View Full Document


School:
Email:
New Password:
Confirm Password:

CORNELL CS 630 - CS630 Lecture 3

Sign up for free to view:

Please select your school