CORNELL CS 674 - Latent Semantic Indexing - D2115613

Home> Schools> Cornell University> Computer Science (CS) > CS 674> Latent Semantic Indexing

DOC PREVIEW

CORNELL CS 674 - Latent Semantic Indexing

School name Cornell University

Course Cs 674- Advanced Language Techologies

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

BackgroundImproving the RepresentationLSILSI and Computational ComplexityObservations About LSIFew-factor RepresentationsFinger ExerciseQuestion 1Question 2Deeper Thought ExerciseQuestionAnswerLatent Semantic IndexingLecture 19, November 6, 2007CS 674/INFO 630 Fall 2007Lecturer: Lillian LeeScribes: Vladimir Barash, Stephen Purpura, Shaomei WuDecember 10, 20071 BackgroundIn the previous lecture, we discussed the Singular Value Decomposition (SVD)of the term-document matrix D ∈ <m × nwhere n is the number of docu-ments in the corpus and m is the number of terms in the vocabulary. Withthe help of SVD (which is unique up to sign if the singular values are dis-tinct), we can decompose an m × n term-document matrix into three specialsmaller matrices. The result is frequently abbreviated D = UΣVTwhere:U =m × rz }| {↑ ↑ ↑| | |~u1. . . ~ur↓ ↓ ↓| {z }left singular vectors, Σ =r × rz }| {σ10 00...00 0 σr| {z }singular values, VT=r × nz }| {← ~v1→← . . . →← . . . →← ~vr→right singular vectors,r is the rank of D; the columns of U and V each form an orthonormal basisfor their span; and the singular values are ordered such that σ1≥ σ2≥ ... ≥σr> 0.For the n-dimensional hypersphere of linear combination coefficients A2={~α ∈ <n| k~αk2= 1}, the hyperellipse DA2which has semiaxes given by the1σi−→uis succinctly approximates the convex hull of the document vectors (~dis)and their negatives. DA2represents a summary of the overall shape of thecorpus and it is sensitive to the document distribution.The advantages of SVD include (1) reduced storage space and (2) re-duced computational complexity when performing calculations on the threecomponent matrices as opposed to on D.Assuming σ16= σ2, of special interest is the first singular vector, ~u1, where~u1= argmax{~u:k~uk2=1}nXi=1(k~dik2cos(∠(~di, ~u)))2That is, ~u1represents a direction of best fit for the corpus, determined inpart by considering common directions among the~didocument vectors andin part by emphasizing the longest~dis.Assuming all the singular values are distinct, each subsequent ~uirepre-sents the “next best” fit to the corpus, in the sense of the equation above, if wefirst subtract from each~diits orthogonal projection onto span({ ~u1,...,~ui−1}).2 Improving the RepresentationThe representation of a document corpus via SVD is already a vast improve-ment over its representation via an m × n term-document matrix. But whatif we could transform this representation into an even more compact (albeitlossy) one? Consider a fixed k < rank(D) and the corresponding rank-kmatrixˆD =ˆUˆΣˆVTwhere:ˆU =m × kz }| {↑ ↑ ↑| | |~u1. . . ~uk↓ ↓ ↓,ˆΣ =k × kz }| {σ10 00...00 0 σk,ˆVT=k × nz }| {← ~v1→← . . . →← . . . →← ~vk→.2By the Eckart-Young theorem, thisˆD is the best approximation withregard to the matrix 2-norm and the Frobenius-norm to D among all matricesof rank k.3 LSILSI (Latent Semantic Indexing)[Deerwester et al.1990] is based on the prin-ciple of applying the Eckart-Young theorem to a term-document matrix (i.e.,creating an “Eckart-Young projection” of a term-document matrix) to repre-sent a corpus. Similarly, LSA (Latent Semantic Analysis) refers to applica-tions of the Eckart-Young theorem to data matrices in situations other thaninformation retrieval.The underlying assumption of LSI is that small singular values (and theassociated ~uis) represent noise in the corpus, so that the original individualdocument vectors (~dis) are located in too many dimensions. This is certainlyplausible, as individual terms exhibit correlations, and it is quite likely thatnot all terms in any given document are necessary to provide an accurate(from the point of view of IR) representation thereof. Finally, proponents ofLSI claim the existence of potentially better geometric relationships betweentheˆDivectors than between the Didocument vectors.To re-state, the goal is to have a better representation of elements andtheir relationships to one another. Since the terms exhibit correlations, theremay exist better document vectors in a k dimensional space, k  r. Thedirections of the hyperellipse DA2(~uis assuming distinct singular values)that have small lengths (small singular values) might correspond to noise(noisy features). If the small singular values represent noise,ˆD gives new(and hopefully better) geometric relationships among the document vectors.3.1 LSI and Computational ComplexityIt is very important to note that LSI does not require storage of, or operationsupon, the fullˆD matrix. Consider a common measure of similarity betweentwo document vectors,ˆdi·ˆdj. Let us first rewriteˆD:3ˆD =m × kz }| {↑ ↑ ↑| | |~u1. . . ~uk↓ ↓ ↓k × nz }| {↑ ↑ ↑| | |~β1. . .~βn↓ ↓ ↓where B =ˆΣˆVT. Then~βiis a vector corresponding to the coordinates ofˆdiin the basis ~u1, ..., ~uk. Then:ˆdi·ˆdj=~βi·~βjso the actual computation ofˆD is unnecessary.In a similar vein, computing similarity measures (for example, cosinesimilarity) between some query and documents in the corpus is a relativelysimple matter: consider the query vector ~q. The cosine similarity between qand some diin D as measured between their vector representations in <mis:cos(~di, ~q) =~di· ~qk~dik2× k~qk2where k~qk2is document independent and would not actually be computed.Now, what about computing cosine similarity in the new k-dimensionalsubspace SU, namely, the subspace with orthonormal basis BUdef= { ~u1, . . . , ~uk}?The k coordinates of the orthogonal projection of ~q onto SUwith respect toBUare given by the vector~ˆq = (~q · ~u1, ...~q · ~uk)T. So, we can assign a scorebetween the projections of ~q and~dionto SUas follows, by analogy with theequation above:~βi·~ˆq||~βi||2(note that one can pre-compute the L2norms of the n projected documentvectors).3.2 Observations About LSIIn general, LSI is a technique for (hopefully/partially) capturing term co-occurrence patterns within a corpus. These co-occurrences can correspond4to conceptual similarity, e.g. between the terms “stochastic” and “probabilis-tic”, or to collocations, e.g. the terms “humpty” and “dumpty”. It is notclear whether either of these co-occurrence types is preferable

View Full Document