CORNELL CS 674 - Pivoted Document Length Normalization - D2070145

Home> Schools> Cornell University> Computer Science (CS) > CS 674> Pivoted Document Length Normalization

DOC PREVIEW

CORNELL CS 674 - Pivoted Document Length Normalization

School name Cornell University

Course Cs 674- Advanced Language Techologies

Pages 19

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Final Lecture 3 Notes for INFO 630 - CS 674.pdfINFO 630 / CS 674 Lecture Notes INFO630 fingerEX-lecture4-oct2.pdfINFO 630 Deeper Thought-oct3.pdfINFO 630 / CS 674 Lecture Notes Pivoted Document Length NormalizationLecturer: Lillian LeeLecture 3: September 4, 2007Scribes: Vladimir Barash, Stephen Purpura, Shaomei Wu Introduction and MotivationToday's lecture notes cover pivoted document length normalization by Singhal, Buckley, and Mitra from SIGIR '96. Before we dive into the details, we will review our classic VSM (vector space model) ad hoc information retrieval derivation1. Our information retrieval goal is to rank documents (which are elements of a corpus) by the notion of relevance to the query, q. The query, q, expresses the user's information needs. The vector space model represents documents (d) in a corpus as vectors, each entry of which corresponds to a term. Each term is an element of the corpus vocabulary. The vocabulary might be the set of all words, phrases, or units of observation occurring in the document, but sometimes the set of terms is more restricted (see Porter Stemming for an example). The document vector's elements are term weights, d[1],...,d[m], with each element corresponding to a weight for the document's use of vocabulary terms v1,...,vm. The document vector's term weights are, by consensus, formed from three components: In the above equation,• tfd(j) is some function based on the term frequency within the document d • idf(j) is inversely related to the number of documents in C that contain vjThe term-weighting component is intended to measure whether the term is a good characterizer for the document, and it is used within our match (or scoring) function: Unfortunately, the term-weighting scheme we have developed can unfairly advantage long documents in two ways:(1) term frequency 'tf' counts are bigger in larger documents because there is a larger pool ofword positions to choose from.(2) there are more non-zero term frequencies ('tf') because the probability of any word in the vocabulary appearing in a long document increases relative to short documents.We wish to avoid a match function which has a bias towards producing higher relevance rankings for long documents over shorter documents, if the shorter documents are actually more relevant. Therefore, normalization is used to compensate within the match function for this bias.In the last lecture, we reviewed two normalization methods, L1-normalization and L2-normalization, for correcting the match function for the bias. L2-normalization was shown to be more useful than L1-normalization. In this lecture, we examine "pivoted document length normalization" from [SBM '96]. The paper is an interesting example of empirical research conducted by graduate students at Cornell, because researchers confronted the question of whether L2-normalization was the best engineering solution for achieving the user's information retrieval goal. Using empirical research methods, the researchers investigate and conclude that "better retrieval effectiveness results when a normalization strategy retrieves documents with chances similar to their probability of relevance." [SBM '96]L2-Normalization ReviewWe (and [SBM '96]) notice that the choice of normalization function is partially based on theory, but mostly attuned to achieving high performance in our information retrieval goal of highly ranking the most relevant documents to match the query. In this sense, prior to [SBM '96] the assumption was that L2-normalization was performing well. [SBM '96] investigates how well L2-normalization performed in practice on the TREC corpora, and it proposes a new normalization function, "pivoted document normalization", which the authors demonstrate is better at achieving the enumerated user information need as specified for a subset of the TREC corpora.Recall that our normalization function, norm(d), can be considered the length penalty which addresses the two problems of bias caused by long documents. In L2-Normalization, norm(d) is set as follows: Our norm(d) is applied to terms in the match function but it is (A) term independent and (B) document dependent. Our norm(d) function corresponds to cosine scoring and cosine scoring seems reasonable, and not obviously refutable. At the time of [SBM '96], L2 was a common normalization function in the information retrieval literature.Empirically Validating the Performance of the L2-Normalization FunctionThe first task in [SBM '96] is to empirically check whether the performance of the norm function (within the context of the term-weighting function and match function) is 'well fit'. More concretely:How does the length distribution of (truly) relevant documents compare to the length distribution of retrieved documents, with respect to L2-normalization? The plots below in Figure 1 from [SBM '96] seek to highlight the comparison. First, 741,856 documents from the TREC corpora were ranked in order of file byte size and then divided into 742 bins of 1,000 documents (the final bin with the largest file sizes had 856 documents). For the plots in Figure 1, the median document length of each bin was used to generate a point for the bin. Next, 9805 'relevant' query-document pairs (q,d) were generated by finding where the document d was judged relevant to a query q for 50 TREC queries matched across the 741,856 documents. Since the objective of normalization is to compensate for ranking bias caused by document length,Figure 1 helps us clarify the relationship between relevance and document length in real-world corpora. Graph (a) shows that the probability increases with file size that a relevant document will be in a bin with a larger file size. Long documents do have higher relevance compared with short documents, which can be explained by the fact that long documents usually cover more topics and have broader content. Graph (b) shows that the probability that a relevant document is retrieved using L2-normalization follows a similar pattern to Graph (a), but the probability of retrieval for larger documents does not increase as rapidly as the probability of relevance expressed in graph (a). Graph (c) shows the implications of graphs (a) and (b). The penalty imposed by L2-normalization on relevant documents with document length larger than a pivot point (p) is actually greater than desired. Documents shorter than pivot point (p) have a greater probability of retrieval

View Full Document