Stanford LING 289 - Indexing by Latent Semantic Analysis - D1267280

Home> Schools> Stanford University> (LING) > LING 289> Indexing by Latent Semantic Analysis

Stanford LING 289 - Indexing by Latent Semantic Analysis

Course Ling 289- History of Computational Linguistics

Pages 17

Download Save

Unformatted text preview:

IntroductionDeficiencies of Current Automatic Indexing and Retrieval MethodsRationale of the Latent Semantic Indexing (LSI) MethodTABLE 1.SVD or Two-Mode Factor AnalysisTABLE 2.FIG. 1.FIG. 2.Tests of the SVD Latent Semantic Indexing (LSI) MethodFIG. 3.FIG. 4.FIG. 5.FIG. 6.Summary of Results from LSI AnalysesConclusions and DiscussionAppendix. SVD Numerical ExampleAcknowledgmentsReferencesIndexing by Latent Semantic Analysis Scott Deerwester Center for Information and Language Studies, University of Chicago, Chicago, IL 60637 Susan T. Dumais*, George W. Furnas, and Thomas K. Landauer Bell Communications Research, 445 South St., Morristown, NJ 07960 Richard Harshman University of Western Ontario, London, Ontario Canada A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 or- thogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are re- turned. initial tests find this completely automatic method for retrieval to be promising. Introduction We describe here a new approach to automatic indexing and retrieval. It is designed to overcome a fundamental problem that plagues existing retrieval techniques that try to match words of queries with words of documents. The problem is that users want to retrieve on the basis of con- ceptual content, and individual words provide unreliable evidence about the conceptual topic or meaning of a docu- ment. There are usually many ways to express a given concept, so the literal terms in a user’s query may not match those of a relevant document. In addition, most words have multiple meanings, so terms in a user’s query will literally match terms in documents that are not of in- terest to the user. The proposed approach tries to overcome the deficien- cies of term-matching retrieval by treating the unreliability *To whom all correspondence should be addressed. Received August 26, 1987; revised April 4, 1988; accepted April 5, 1988. 0 1990 by John Wiley & Sons, Inc. of observed term-document association data as a statistical problem. We assume there is some underlying latent se- mantic structure in the data that is partially obscured by the randomness of word choice with respect to retrieval. We use statistical techniques to estimate this latent structure, and get rid of the obscuring “noise.” A description of terms and documents based on the latent semantic structure is used for indexing and retrieval.’ The particular “latent semantic indexing” (LSI) analysis that we have tried uses singular-value decomposition. We take a large matrix of term-document association data and construct a “semantic” space wherein terms and documents that are closely associated are placed near one another. Singular-value decomposition allows the arrangement of the space to reflect the major associative patterns in the data, and ignore the smaller, less important influences. As a result, terms that did not actually appear in a document may still end up close to the document, if that is consistent with the major patterns of association in the data. Position in the space then serves as the new kind of semantic index- ing. Retrieval proceeds by using the terms in a query to identify a point in the space, and documents in its neigh- borhood are returned to the user. Deficiencies of Current Automatic Indexing and Retrieval Methods A fundamental deficiency of current information retrieval methods is that the words searchers use often are not the same as those by which the information they seek has been indexed. There are actually two sides to the issue; we will call them broadly synonymy and polysemy. We use syn- onymy in a very general sense to describe the fact that ‘By “semantic structure” we mean here only the correlation structure in the way in which individual words appear in documents; “semantic” implies only the fact that terms in a document may be taken as referents to the document itself or to its topic. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 41(6):391-407, 1990 CCC 0002-6231/90/060391-17$04.00there are many ways to refer to the same object. Users in different contexts, or with different needs, knowledge, or linguistic habits will describe the same information using different terms. Indeed, we have found that the degree of variability in descriptive term usage is much greater than is commonly suspected. For example, two people choose the same main key word for a single well-known object less than 20% of the time (Furnas, Landauer, Gomez, & Dumais, 1987). Comparably poor agreement has been reported in studies of interindexer consistency (Tarr & Borko, 1974) and in the generation of search terms by either expert inter- mediaries (Fidel, 1985) or less experienced searchers (Liley, 1954; Bates, 1986). The prevalence of synonyms tends to decrease the “recall” performance of retrieval systems. By polysemy we refer to the general fact that most words have more than one distinct meaning (homography). In different contexts or when used by different people the same term (e.g., “chip”) takes on varying referential significance. Thus the use of a term in a search query does not neces- sarily mean that a document containing or labeled by the same term is of interest. Polysemy is one factor underlying poor “precision .” The failure of current automatic indexing to overcome these problems can be

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford LING 289 - Indexing by Latent Semantic Analysis

Sign up for free to view:

Please select your school