# Wright CS 707 - Applications of linear algebra in information retrieval and hypertext analysis (9 pages)

Previewing pages*1, 2, 3*of 9 page document

**View the full content.**## Applications of linear algebra in information retrieval and hypertext analysis

Previewing pages
*1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## Applications of linear algebra in information retrieval and hypertext analysis

0 0 67 views

- Pages:
- 9
- School:
- Wright State University
- Course:
- Cs 707 - Information Retrieval

**Unformatted text preview:**

Applications of linear algebra in information retrieval and hypertext analysis Andrew Tomkinst Jon Kleinberg 1 We then turn to hyperlinked corpora collections of documents with an underlying link structure The emergence of the World Wide Web 2 has led to a surge of interest in the problem of information retrieval in such domains we describe some approaches that apply spectral methods to link structures for information discovery tasks e g 8 431 Th ere are connections between this work and earlier work in sociology and citation analysis 24 and we discuss this as well Overview Information retrieval is concerned with representing content in a form that can be easily accessed by users with information needs 61 651 A definition at this level of generality applies equally well to any index based retrieval system or database application so let us focus the topic a little more carefully Information retrieval as a field works primarily with highly unstructured content such as text documents written in natural language it deals with information needs that are generally not formulated according to precise specifications and its criteria for success are based in large part on the demands of a diverse set of human users Our purpose in this short article is not to provide a survey of the field of information retrieval for this we refer the reader to texts and surveys such as 25 29 51 60 61 62 63 65 701 Rather we wish to discuss some specific applications of techniques from linear algebra to information retrieval and hypertext analysis In particular we focus on spectral methods the use of eigenvectors and singular vectors of matrices and their role in these areas After briefly introducing the use of vector space models in information retrieval 52 651 we describe the application of the singular value decomposition to dimensionreduction through the Latent Semantic Indexing technique 14 We contrast this with several other approaches to clustering and dimension reduction based on vector space models 2 Information retrieval and the vector space model The language of linear algebra made its appearance quite early in information retrieval through the use of vectorspace models 52 651 these models have formed the basis for information retrieval frameworks such as Salton s SMART system see e g lo 651 We begin with a set of d documents and a set of t terms We model each document as a vector x in the t dimensional space Rt it has one coordinate for each term The jth coordinate of x is a number that measures the association of the jth term with respect to the given document it is generally defined to be 0 if the document does not contain the term and non zero otherwise The problem of how to define the non zero entries in such a vector is known as term weighting and it has been the subject of a large amount of work see e g 17 64 65 681 Perhaps the simplest formulation is to set xj 1 if the jth term occurs at all in the document More general approaches based on term frequency and inverse document frequency take into account the number of times the term occurs in the document and the total number of documents in the corpus in which the term occurs The representation of documents by vectors in Euclidean space allows one to bring geometric methods to bear in analyzing them At the simplest level the representation naturally suggests numerical similarity metrics for documents based on the Euclidean distance or the inner product Again many related metrics have been proposed and we discuss one representative the cosine measure 65 that will help motivate some of Department of Computer Science Cornell University Ithaca NY 14853 Email kleinberOcs cornell edu Supported in part by an Alfred P Sloan Research Fellowship and by NSF Faculty Early Career Development Award CCR 9701399 tIBM Almaden Research Center San Jose CA 95120 Email tomkinsOalmaden ibm com 18 5 the developments to follow Let C and y be two document vectors We define their cosine similarity by the equation where the inner product x y is the standard vector dot product defined as xi xiyi and the norm in the denominator is defined as 1x1 fi We term this the cosine similarity because for any two unit vectors it is simply the cosine of the angle between them Numerical similarity metrics on documents suggest natural approaches for similarity based indexing e g 66 by representing textual queries as vectors and searching for their nearest neighbors in a collection of documents as well as for clustering e g 40 Of course in any application with a large number of underlying terms these vector operations are being carried out in a huge number of dimensions Very high dimensionality can be a problem not only from the point of view of computational efficiency but also because the large number of terms leads to sets of vectors with very sparse patterns of non zeroes in which relationships among terms e g synonymy can be difficult to detect or exploit An effective method for reducing the dimension of the set of vect ors without seriously distorting their metric structure offers the possibility of alleviating both these problems We turn to this issue next beginning with some fundamental background material from linear algebra 3 Linear algebra eigenvalues value decomposition Xi M so that the set of vectors wi M forms an orthonormal basis of R each is a unit vector and each pair of them is orthogonal We say that a matrix Q is orthogonal if QTQ I where QT denotes the transpose of the matrix M and I represents the identity matrix a diagonal matrix with all diagonal entries equal to 1 If M is a symmetric n x n matrix A is the diagonal matrix with diagonal entries Xl M X2 M X M and Q is the matrix with columns equal to WI M wn M then it is easy to verify that Q is an orthogonal matrix and QAQ r M Thus the eigenvalues and eigenvectors provid e a useful 5rormal form representation for symmetric square matrices in terms of orthogonal and diagonal matrices In fact there is a way to extend this type of normal form to matrices that are neither symmetric nor square as we now discuss Theorem 1 Singular Value Decomposition Every m x n matrix A can be written A UCVT U and V are orthogonal and C is diagonal 13VD where That is we can rewrite any matrix A as follows We refer to the diagonal entries of C as the sinvalues of A Using the SVD directly we can write ATA UCVT T UCVT VC2VT Likewise AAT UC2UT It follows that the columns of U and V represent the eigenvectors of AAT and ATA

View Full Document