Indexing by Latent Semantic Analysis Scott Deerwester Center for Information and Language Studies University of Chicago Chicago IL 60637 Susan T Dumais George W Furnas and Thomas K Landauer Bell Communications Research 445 South St Morristown NJ 07960 Richard Harshman University of Western Ontario London Ontario Canada A new method for automatic indexing and retrieval is described The approach is to take advantage of implicit higher order structure in the association of terms with documents semantic structure in order to improve the detection of relevant documents on the basis of terms found in queries The particular technique used is singular value decomposition in which a large term by document matrix is decomposed into a set of ca 100 orthogonal factors from which the original matrix can be approximated by linear combination Documents are represented by ca 100 item vectors of factor weights Queries are represented as pseudo document vectors formed from weighted combinations of terms and documents with supra threshold cosine values are returned initial tests find this completely automatic method for retrieval to be promising Introduction We describe here a new approach to automatic indexing and retrieval It is designed to overcome a fundamental problem that plagues existing retrieval techniques that try to match words of queries with words of documents The problem is that users want to retrieve on the basis of conceptual content and individual words provide unreliable evidence about the conceptual topic or meaning of a document There are usually many ways to express a given concept so the literal terms in a user s query may not match those of a relevant document In addition most words have multiple meanings so terms in a user s query will literally match terms in documents that are not of interest to the user The proposed approach tries to overcome the deficiencies of term matching retrieval by treating the unreliability of observed term document association data as a statistical problem We assume there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice with respect to retrieval We use statistical techniques to estimate this latent structure and get rid of the obscuring noise A description of terms and documents based on the latent semantic structure is used for indexing and retrieval The particular latent semantic indexing LSI analysis that we have tried uses singular value decomposition We take a large matrix of term document association data and construct a semantic space wherein terms and documents that are closely associated are placed near one another Singular value decomposition allows the arrangement of the space to reflect the major associative patterns in the data and ignore the smaller less important influences As a result terms that did not actually appear in a document may still end up close to the document if that is consistent with the major patterns of association in the data Position in the space then serves as the new kind of semantic indexing Retrieval proceeds by using the terms in a query to identify a point in the space and documents in its neighborhood are returned to the user Deficiencies of Current Automatic Indexing and Retrieval Methods A fundamental deficiency of current information retrieval methods is that the words searchers use often are not the same as those by which the information they seek has been indexed There are actually two sides to the issue we will call them broadly synonymy and polysemy We use synonymy in a very general sense to describe the fact that To whom all correspondenceshould be addressed ReceivedAugust 26 1987 revised April 4 1988 acceptedApril 5 1988 0 1990by John Wiley Sons Inc JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION By semantic structure we mean here only the correlation structure in the way in which individual words appear in documents semantic implies only the fact that terms in a documentmay be taken as referents to the documentitself or to its topic SCIENCE 41 6 391 407 1990 CCC 0002 6231 90 060391 17 04 00 mains to be determined Not only is there a potential issue of ambiguity and lack of precision but the problem of identifying index terms that are not in the text of documents grows cumbersome This was one of the motives for the approach to be described here The second factor is the lack of an adequate automatic method for dealing with polysemy One common approach is the use of controlled vocabularies and human intermediaries to act as translators Not only is this solution extremely expensive but it is not necessarily effective Another approach is to allow Boolean intersection or coordination with other terms to disambiguate meaning Success is severely hampered by users inability to think of appropriate limiting terms if they do exist and by the fact that such terms may not occur in the documents or may not have been included in the indexing The third factor is somewhat more technical having to do with the way in which current automatic indexing and retrieval systems actually work In such systems each word type is treated as independent of any other see for example van Rijsbergen 1977 Thus matching or not both of two terms that almost always occur together is counted as heavily as matching two that are rarely found in the same document Thus the scoring of success in either straight Boolean or coordination level searches fails to take redundancy into account and as a result may distort results to an unknown degree This problem exacerbates a user s difficulty in using compound term queries effectively to expand or limit a search there are many ways to refer to the same object Users in different contexts or with different needs knowledge or linguistic habits will describe the same information using different terms Indeed we have found that the degree of variability in descriptive term usage is much greater than is commonly suspected For example two people choose the same main key word for a single well known object less than 20 of the time Furnas Landauer Gomez Dumais 1987 Comparably poor agreement has been reported in studies of interindexer consistency Tarr Borko 1974 and in the generation of search terms by either expert intermediaries Fidel 1985 or less experienced searchers Liley 1954 Bates 1986 The prevalence of synonyms tends to decrease the recall performance of retrieval systems By polysemy we refer to the general fact that most words
View Full Document
Unlocking...