Wright CS 707 - Applications of linear algebra in information retrieval and hypertext analysis - D1377494

Home> Schools> Wright State University> Counseling (CS) > CS 707> Applications of linear algebra in information retrieval and hypertext analysis

Wright CS 707 - Applications of linear algebra in information retrieval and hypertext analysis

School name Wright State University

Course Cs 707- Information Retrieval

Pages 9

Download Save

Unformatted text preview:

Applications of linear algebra in information retrieval and hypertext analysis Jon Kleinberg * Andrew Tomkinst 1 Overview Information retrieval is concerned with representing con- tent in a form that can be easily accessed by users with information needs [61, 651. A definition at this level of generality applies equally well to any index-based re- trieval system or database application; so let us focus the topic a little more carefully. Information retrieval, as a field, works primarily with highly unstructured con- tent, such as text documents written in natural lan- guage; it deals with information needs that are gener- ally not formulated according to precise specifications; and its criteria for success are based in large part on the demands of a diverse set of human users. Our purpose in this short article is not to provide a survey of the field of information retrieval - for this we refer the reader to texts and surveys such as [25, 29, 51, 60, 61, 62, 63, 65, 701. Rather, we wish to discuss some specific applications of techniques from linear al- gebra to information retrieval and hypertext analysis. In particular, we focus on spectral methods - the use of eigenvectors and singular vectors of matrices - and their role in these areas. After briefly introducing the use of vector-space mod- els in information retrieval [52, 651, we describe the ap- plication of the singular value decomposition to dimension- reduction, through the Latent Semantic Indexing tech- nique [14]. We contrast this with several other ap- proaches to clustering and dimension-reduction based on vector-space models. *Department of Computer Science, Cornell University, Ithaca NY 14853. Email: kleinberOcs.cornell.edu. Supported in part by an Alfred P. Sloan Research Fellowship and by NSF Faculty Early Career Development Award CCR-9701399. tIBM Almaden Research Center, San Jose CA 95120. Email: tomkinsOalmaden.ibm.com. We then turn to hyperlinked corpora - collections of documents with an underlying link structure. The emergence of the World Wide Web [2] has led to a surge of interest in the problem of information retrieval in such domains; we describe some approaches that apply spec- tral methods to link structures for information discovery tasks (e.g. [8, 431). Th ere are connections between this work and earlier work in sociology and citation analysis [24], and we discuss this as well. 2 Information retrieval and the vector space model The language of linear algebra made its appearance quite early in information retrieval, through the use of vector- space models [52, 651; these models have formed the ba- sis for information retrieval frameworks such as Salton’s SMART system (see e.g. [lo, 651). We begin with a set of d documents and a set of t terms. We model each document as a vector x in the t-dimensional space Rt - it has one coordinate for each term. The jth coor- dinate of x is a number that measures the association of the jth term with respect to the given document - it is generally defined to be 0 if the document does not contain the term, and non-zero otherwise. The problem of how to define the non-zero entries in such a vector is known as term-weighting, and it has been the subject of a large amount of work; see e.g. [17, 64, 65, 681. Perhaps the simplest formulation is to set xj = 1 if the jth term occurs at all in the docu- ment. More general approaches based on’term frequency and inverse document frequency take into account the number of times the term occurs in the document, and the total number of documents in the corpus in which the term occurs. The representation of documents by vectors in Eu- clidean space allows one to bring geometric methods to bear in analyzing them. At the simplest level, the representation naturally suggests numerical similarity metrics for documents, based on the Euclidean distance or the inner product. Again, many related metrics have been proposed, and we discuss one representative -the cosine measure [65] - that will help motivate some of 18.5‘the developments to follow. Let :C and y be two docu- ment vectors. We define their cosine similarity by the equation where the inner product x y is the standard vector dot product, defined as xi=, xiyi, and the norm in the denominator is defined as 1x1 = fi. We term this the cosine similarity because for any two unit vectors, it is simply the cosine of the angle between them. Numerical similarity metrics on documents suggest natural approaches for similarity-based indexing (e.g. [66]) - by representing textual queries as vectors and search- ing for their nearest neighbors in a collection of docu- ments - as well as for clustering (e.g. [40]). Of course, in any application with a large number of underlying terms, these vector operations are being carried out in a huge number of dimensions. Very high dimensionality can be a problem not only from the point of view of com- putational efficiency, but also because the large number of terms leads to sets of vectors with very sparse pat- terns of non-zeroes, in which relationships among terms (e.g. synonymy) can be difficult to detect or exploit. An effective method for reducing the dimension of the set of vect,ors, without seriously distorting their metric structure, offers the possibility of alleviating both these problems. We turn to this issue next, beginning with some fundamental background material from linear al- gebra. 3 Linear algebra, eigenvalues, and the singular value decomposition Given a set of d vectors representing a collection of d documents, we can construct a t x d matrix in which each document vector constitutes one of the columns. Our interest will be in transforming this matrix to one that has low rank; this will

View Full Document