CORNELL CS 630 - CS 630 Lecture Notes - D2835152

Home> Schools> Cornell University> Computer Science (CS) > CS 630> CS 630 Lecture Notes

CORNELL CS 630 - CS 630 Lecture Notes

Course Cs 630- Representing and Accessing Digital Information

Pages 13

Download Save

Unformatted text preview:

Representing and Accessing [Textual] Digital Information (COMS/INFO 630), Spring 2006 Lecture 17: The Singular Value Decomposition 4/4/06 Lecturer: Lillian Lee Scribes: Nicolas Hamatake (nh39), Min Park (mp257) 1 Introduction 1.1 Overview In today’s lecture we will finally state the Singular Value Decompotion ‘theorem’. To build some intuition for it, we will continue exploring the underlying geometric interpretations of Matrix Theoretic Corpus Characterizations. We wish to develop a general way of succinctly describing properties inherent in some corpora using matrices. Since we already know of a vector space representation for documents, we would be interested in characterizing the spread of the document vectors in a geometric sense. We will revisit two possibilities for characterizations that were discussed last time, and consider two more today1: 1. Rank(D); 2. The convex hull of the column vectors of D 3. The convex hull of the column vectors of D and - D 4. Mapping a unit (n-1)-sphere (⊂nR) by some D, as a linear operator Having spent nearly two lectures in algebra and geometry with only sketchy references to its application to information retrieval, the lecture concludes by initiating a discussion of the implications of SVD from an IR perspective. 1.2 Inherent Properties of the Corpus—No Queries Involved The conclusion of the lectures related to relevance feedback brought forth the question, can corpora be characterized prior to having feedback information? Until the beginning of last lecture, most of our references to corpora have been with respect to some sort of relevance information related to some query (including the query itself). An exception would be the idf. Should we expect that there are properties “inherent” of a corpus? Is it reasonable to think that corpora have structure? It certainly seems plausible that structures of corpora would be formed and maintained naturally and/or intentionally. For instance, the creator(s) of a corpus might intend to perform some process (such as information retrieval) on it, or expect that others would. To study structural properties of corpora, we began with a search for a good representation of the corpus—a succinct and informative summary of it. Last time, new and also some familiar vector space notation was discussed, along with two potential corpus characterizations. 1 See Section 2.1 for explanation of the notation presented.Page 2 of 13 2 Matrix-Theoretic Corpus Characterization 2.1 Review of Basic Notation Assume there are n documents in the corpus with vector representations,  (1) ( ),...,nd d; Assume an m-term vocabulary; Then, the feature term-document matrix is an m by n matrix denoted by   =    (1) ( )| |...| |nd dD . Superscripts will be used to denote matrix columns. Subscripts will be used to denote vector components. 2.2 Characterizations The goal is to succinctly characterize the corpus in terms of the variability of its documents. With respect to of our matrix framework and its notation, we are interested in the ‘spread’ of the( )id’s. Two descriptions of the corpus characterizations introduced in the previous lecture are reviewed, and two more are discussed. Each new consideration is intended to refine the previous corpus description such that it might be a more informative characterization. A guiding example is introduced in Section 2.2.1.2 which will be referenced throughout this section. 2.2.1 Two Characterizations from Last Lecture: 1. The dimension of all possible linear combinations of the ( )id’s Rank(D) is a number which is equal to the dimension of the span (all possible linear combinations) of {}=1( )niid ; it can thought of as a measure of complexity in the corpus structure. Mathematically, the rank of a matrix describes the number of its linearly independent columns, and in the non-full-rank case implies that some are linear combinations of the others. In the context of document vectors, the rank of D would reveal how many of its columns are sufficient to describe all n columns representing the documents in the corpus. So if Rank(D) is ‘relatively’ close to n, one might infer that the corpus is complex. So perhaps we can consider the rank as some measure of variability among the documents, particularly since we are interested in developing some geometric intuition. Since this corpus characterization is just one real number, it is very succinct. However, desired information is lost. For instance, if it is the case that the feature-document space represented by D is spanned by just two collinear document vectors (1)dand ( 2)d, Rank(D)=2 regardless of the configuration of the two vectors and so will not differentiate between the following two corpora: (1)d(1)d (1)d( 2)d (1)d(1)d (1)d( 2)d (1)d⇔( )RankDPage 3 of 13 On the left hand side of the above diagram, the represented corpus is more spread out than the one represented on the right hand side. In this sense, they are different and Rank(D) is not ‘fine grained’ enough to capture this point. Could there be a different characterization that is still succinct, and still retains more information about the corpus structure? Since Rank(D) in indirectly considering a set of all possible linear combinations of a basis set of vectors for the feature document space by describing its dimension, maybe a subset of the set of linear combinations (henceforth denoted by a) could be more informative. 2. One Particular Set of Linear Combinations: 0a To consider the new idea of examining properties of a portion of span(D), we can consider coefficient vectors denoted by ααα1= ∈     nnR, that are associated with a restricted set of linear combinations of the ( )id’s. Furthermore, we take an intuitively unusual perspective of D as an operator on these coefficient vectors2. . Now we consider a special portion of nRthat corresponds to fractional assignment of the columns of the feature term-document matrix. That is, α α α= = ∈ ≥ =  ∑01: 0, 1nni iia R , To guide our discussion, consider the case in which n = 3, and m = 2. Then, α α α= = ∈ ≥ =  ∑3301: 0, 1i iia R , and can be geometrically illustrated in the following diagram. Note: ⊂0a a. 2

View Full Document


School:
Email:
New Password:
Confirm Password:

CORNELL CS 630 - CS 630 Lecture Notes

Sign up for free to view:

Please select your school