CORNELL CS 630 - Matrix-Theoretic Corpus Characterizations - D1789893

Home> Schools> Cornell University> Computer Science (CS) > CS 630> Matrix-Theoretic Corpus Characterizations

CORNELL CS 630 - Matrix-Theoretic Corpus Characterizations

Course Cs 630- Representing and Accessing Digital Information

Pages 8

Download Save

Unformatted text preview:

CS630 Representing and Accessing Digital Information Lecture 16: March 30, 2006 1Matrix-Theoretic Corpus Characterizations Scribed by Gilly Leshed, Blazej Kot This lecture is the start of a round-about introduction to Latent Semantic Indexing (LSI), via first explaining the ideas and the apparent magic underlying the Singular Value Decomposition (SVD). 1 Implicit Feedback and Support Vector Machines Recap We addressed using implicit feedback in a framework in which the available information is represented as points in a high-dimensional space. The dimensions of the space were the attributes of the vectors, where each vector represents a document-query dyad: • Axis in the space ≈ feature / variable • Vector component ≈ value of the variable for the item under representation Implicit feedback created specific constraints, for example, that document d is more relevant than document d’ for a given query q. The “magic” weight vector wr gives the preferred “direction” with respect to relevance to queries. wr represents the fact that vectors farther along its direction are more preferred. The points in the space do not depend on the feedback information implicitly extracted from the user, but, they are query-dependent. Thus, classification is based on knowing the user’s query. However, finding the weight vector wr does depend on the feedback information, as discussed in the previous lecture. The question that arises is, therefore, perhaps for a general purpose it might be useful to find a general characterization of the corpus, even without knowing the query. This could give us some indication of the structure of the corpus, or some inherent properties of the corpus. The idea is that we could wisely select the dimensions of the vector space, or the features along which we characterize the corpus, even without knowing the queries. We begin with representing the corpus as a feature-document matrix, which is presented as a term-document matrix in the next section. Φ(q,d) Φ(q,d’) wr In this example, the horizontal axis could be, for example, (# Finnish terms - # English terms), and therefore points could obtain negative values as well. The vertical axis could be, for example, the value of)d,qcos(rr.CS630 Representing and Accessing Digital Information Lecture 16: March 30, 2006 22 Term-Document Matrix Assume the corpus consists of n documents, each of which can be represented by an m-element feature vector, )n()1(d,...,drr. We can combine these by writing them as columns of a matrix: Each element of a document vector corresponds to the value of one document feature. These features can be as simple as the number of times a vocabulary term occurs in the document; hence the label Term-Document Matrix. 3 Measuring Corpus Spread Given this representation, we can now ask what the corpus looks like. In other words, how are the documents “spread” around the space? Are they evenly spread around, indicating that the corpus is less structured, or are they laid out in small clusters of related documents? This spread indicates the degree to which the corpus is varied in the vector space. Note that the spread depends greatly on the selected features, which could be other than term occurrences in documents. Selecting certain features could increase or decrease the spread of the corpus. One approach to this question could be by looking at the span of the document vectors: })d,...,d({spans)n()1(rr= We can then consider the dimensionality of the span ≡ }))d,...,d({spandim()n()1(rr as a measure of the “complexity” of the corpus. Note that dim(s) is the rank of the term-document matrix, and recall from linear algebra: r = dim(s) = rank(D) = rank(DT) It may be hard to perceive how the rank of D is equal to the rank of DT, especially because each one of them represents different things: • D has n columns of documents, and shows how the values are distributed in each document between the features • DT has m columns of features, and shows how the values are distributed in each feature between the documents r = 1 To gain an understanding of what r = rank(D) measures, let’s first consider the case r = 1. In this case there exists a basis vector mb ℜ∈r such that each document )i(drin the corpus can be represented as a scalar multiple of this basis vector: bdi)i(rrα= for some ℜ∈αi ⎥⎥⎥⎦⎤⎢⎢⎢⎣⎡=||d...d||D)n()1(rr We will refer to elements of this matrix as )i(jD where the superscript i refers to the matrix column, i.e. the document index, and the subscript j refers to the matrix row, i.e. the feature index.CS630 Representing and Accessing Digital Information Lecture 16: March 30, 2006 3This means that in the corpus there is a single direction along which all document vectors lie. Furthermore, if we require that || br||2 = 1, then the basis vector is unique, up to sign.1 1 ≤ r ≤ min (m,n) In the general case, 1 ≤ r ≤ min (m,n). The upper limit is there since the rank of a matrix cannot be larger than the smallest between the number of columns and rows of the matrix. In this case there exist r basis vectors m)r()1(b,...,b ℜ∈rrsuch that each document in the corpus can be represented as a linear combination of them: ∑=α=r1l)l()i(l)i(bdrrfor some ℜ∈α)i(l The basis vectors m)r()1(b,...,b ℜ∈rr are orthonormal, i.e. each is unit-length in 2-norm and they are mutually orthogonal. Notice that given )r()1(b,...,brreach document can now be represented as r scalar values (the )i(lα ’s), linearly combining the basis vectors to give rise to the document vector. So to represent the document, instead of a vector with m scalar elements, we could in principle use vectors with r scalar elements instead: ∑∑==α==r1l)l()i(lm1l)l()i()i(beddrrr Where e(l) is the lth elementary basis vector: Thereby, we see that we can reduce the dimensionality of the vector space chosen to represent the documents from m to r without loss of information. This is known as dimension reduction. Note that each column in the matrix [)r()1(b,...,brr] is still m elements long, however, as indicated by the equality above, the number of vectors required to represent each vector )i(drwas reduced from m to r. 1 We saw earlier how features can have negative values, if we assume that they are not necessarily term counts. ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣⎡=00100e)l(MM m elements, lth element is

View Full Document


School:
Email:
New Password:
Confirm Password:

CORNELL CS 630 - Matrix-Theoretic Corpus Characterizations

Sign up for free to view:

Please select your school