CORNELL CS 630 - CS 630 Lecture 2 - D2723977

Home> Schools> Cornell University> Computer Science (CS) > CS 630> CS 630 Lecture 2

CORNELL CS 630 - CS 630 Lecture 2

Course Cs 630- Representing and Accessing Digital Information

Pages 11

Download Save

Unformatted text preview:

CS630 Representing and Accessing Digital Information Lecture 2: Jan 31, 2006 1 Scribes: N. Sadat Shami, Gilly Leshed Outline 1. An empirical approach to information retrieval (IR) 2. Reminder of the Vector Space Model (VSM) 3. Issues with document length normalization 1. An Empirical Approach to Information Retrieval (IR) Many findings in IR were developed through empirical work, rather than as a result of theory development. For example, the Vector Space Model (VSM) describes a way that a collection of documents can be represented for information retrieval, but there was initially no theoretical underpinning for this model other than some intuitions. The domain of traditional IR is guided by striving to improve and optimize performance (specifically precision) over previous work. The mere demonstration of satisfactory performance results is sufficient to accept one’s work and apply it into a working system. 2. Vector Space Model (VSM) Fundamentals Recall from the previous lecture that the VSM is used to represent documents from a corpus as vectors in the following manner: Document d is represented as vector dr = [d1,…,dj,…,dm]T, where dj represents the weight of term v(j) in document d, and there are m terms in the vocabulary. It is generally accepted that the following three concepts should be incorporated into the term weights of a document: idfj, tfj(d), and normj(d). a. idfj idfj (inverse document frequency) is an attribute of a term in a corpus, representing the rareness of the term. Measures of this quantity decrease with the fraction of documents in the corpus that contain the term v(j). A high value of idfj indicates that the term is rare, and therefore more significant for distinguishing between documents. The lower the measure is, the more widespread the term is in the corpus and across documents, decreasing the term’s distinctive power between documents. For example, in a computer science document collection, the term “computer” may appear in many documents. It will therefore have low idfj value, since it does not assist in discriminating between the documents. On the other hand, if the term “parallel processing” appears in only some of the documents in this collection, then this term would have a higher idfj value, assisting in distinguishing the subset of documents that discuss the topic of parallel processing. Note that idfj is dependent on the entire corpus and not on a specific document. b. tfj(d) tfj(d) (term frequency) represents the occurrence of term v(j) in document d, increasing with the frequency of the term in the document. A high value of tfj(d) implies that theCS630 Representing and Accessing Digital Information Lecture 2: Jan 31, 2006 2 term is representative of the document. For instance, if a document contains the word “cat” many times, there is good chance that the document is about cats. c. normj(d) The raw number of terms in a document creates a bias against the short documents, as long documents simply have more terms in them. Another motivation for normalization is spam: documents that “hide” certain terms to increase the term frequency within them. Therefore, a normalization function is required. This concept is expanded in section 3 below. d. Document Relevance to Query The rationale behind the VSM model for IR is that both queries and documents are represented in a vector space. The higher the similarity between the query and the document, the higher the document is ranked as relevant for the query. In this representation, a document will have higher chances of being relevant for a query if it contains a distinctive query term (i.e., high idfj) that occurs many times (i.e. high tfj(d)), although the normalization factor plays a role in potentially altering this behavior. 3. Document Length Normalization Issues This section discusses a SIGIR ’96 paper by Singhal, Buckley, & Mitra: “Pivoted document length normalization”. a. Why Normalize Document Length There are two main reasons why term weights should be normalized, reducing the advantage that long documents have just because they have more terms in them: 1. Higher term frequencies: long documents tend to repeat the same term over and over. As a result, the term frequency of long documents tends to be higher, increasing the average contribution of its terms towards the query-document similarity [1]. 2. More terms: long documents are likely to have a greater variety of terms than short documents. This increases the number of non-zero term frequency values for the long documents, increasing the number of matches between a query and a long document, and increasing the query-document similarity [1]. b. Normalization Functions Various normalization approaches can be used. For example, one common normalization function is the maximum term frequency: )d(tfmax)d(normjj=. This function attacks the first reason for normalization of higher tf values due to repetitions, but does not deal with the second reason of many non-zero tf values. Therefore, this function tends to favor ∑∑××=××=⋅jjjjjjjjjj)d(norm)d(tf)idfq()d(normidf)d(tfqdqrr Same for all documents Specific for every documentCS630 Representing and Accessing Digital Information Lecture 2: Jan 31, 2006 3 the retrieval of longer documents. This approach can also be problematic from the following reasons: 1. There could be outliers, i.e., a document with a very-very large number of occurances of a specific term, not representative of the rest of the document. 2. A document in which the most frequent term appears X times and the rest of the terms appear fewer than X times should not be treated the same as another document in which many terms appear X times. Another widely used normalization function is cosine normalization, and is computed by: Using the cosine normalization function, the normalization value is higher as due to both higher term frequencies (reason 1 above) and more non-zero tf-s (reason 2). As a result, the following effect on the query function is observed: The rank function is therefore only dependent on the directionality in the vector space, and the length of the document vector is ignored. For example: c. The Relevance/Retrieval Gap Using normalization functions that ignore document length is appropriate if length was not correlated with relevance to the query. This matter was examined by Singhal et al. 2dr 1dr qr In this example, the angles

View Full Document


School:
Email:
New Password:
Confirm Password:

CORNELL CS 630 - CS 630 Lecture 2

Sign up for free to view:

Please select your school