UMD AMSC 663 - Information Retrieval Through Various Approximate Matrix Decompositions - D361291

Home> Schools> University of Maryland, College Park> Applied Mathematics & Scientific Computation (AMSC) > AMSC 663> Information Retrieval Through Various Approximate Matrix Decompositions

UMD AMSC 663 - Information Retrieval Through Various Approximate Matrix Decompositions

School name University of Maryland, College Park

Course Amsc 663- Advanced Scientific Computing I

Pages 13

Download Save

Unformatted text preview:

Information Retrieval Through Various ApproximateMatrix DecompositionsKathryn Linehan, [email protected]: Dr. Dianne O’Leary, [email protected] short, information retrieval is extracting certain informationfrom databases. The purpose of this project is to explore queryinga document database. We investigate and analyze the use of variousapproximate matrix decompositions to accomplish this task.1 BackgroundThe world is full of information, which would not be useful to us if we did nothave a way of organizing and extracting it. In short, information retrieval isextracting certain information from databases. One example of informationretrieval is the web search engine Google. The main issues associated withinformation retrieval are storage space, speed, and how “good” the resultsare, where “good” is a task specific measure. The objective of this project isto investigate querying a document database. A secondary, time permittingobjective is to investigate forming a multidocument summary.1.1 Querying a Document DatabaseGiven a document database, we want to be able to return documents that arerelevant to given query terms. There are existing solutions to this informationretrieval problem, such as literal term matching and latent semantic indexing(LSI). Background information and examples presented in this section aretaken from [5].1.1.1 Problem FormulationIn real systems, such as Google, this problem is formulated in terms of ma-trices. An m × n term-document matrix, A, is created “where entry aij1represents the weight of term i in document j”. Thus, each row in A repre-sents a term and each column in A represents a document. Also, an m × 1query vector, q is created “where qirepresents the weight of term i in thequery”. Different weighting schemes for A and q are discussed in [5].1.1.2 Literal Term MatchingIn literal term matching, a relevance score is computed for each documentas an inner product between qTand the column of A that represents thatdocument. The highest scoring documents are then returned.The original term-document matrix is generally sparse; thus, unfortu-nately, literal term matching may not return relevant documents that con-tain synonyms of query terms, but not the actual query terms. We presentan example of this below.Example: Literal Term MatchingDocumentTerm 1 2 3 4 QueryMark 15 0 0 0 1Twain 15 0 20 0 1Samuel 0 10 5 0 0Clemens 0 20 10 0 0Purple 0 0 0 20 0Fairy 0 0 0 15 0Score 30 0 20 0As seen in the above example, we have queried for the terms “MarkTwain”. We notice that document 2 does not contain the terms “MarkTwain”, but does contain the terms “Samuel Clemens” (who is the sameperson as Mark Twain) and is therefore relevant to the query. However, ithas a relevance score of zero and thus is not returned as relevant. We wouldlike to have a document querying system in which this problem is solved.21.1.3 Latent Semantic IndexingIn latent semantic indexing (LSI), an approximation to the term-documentmatrix is used to compute document relevance scores. Using a matrix ap-proximation can introduce nonzero entries, possibly revealing relationshipsbetween synonyms, and thus relevant documents that may not have the exactquery terms may be returned.A commonly used approximate matrix decomposition in LSI is a rank-ksingular value decomposition (rank-k SVD). Below, we present an exampleof LSI using a rank-2 approximation to the term-do cument matrix presentedin the literal term matching example.Example: Latent Semantic IndexingDocumentTerm 1 2 3 4 QueryMark 3.7 3.5 5.5 0 1Twain 11.0 10.3 16.1 0 1Samuel 4.1 3.9 6.1 0 0Clemens 8.3 7.8 12.2 0 0Purple 0 0 0 20 0Fairy 0 0 0 15 0Score 14.7 13.8 21.6 0We see that in this example, by using LSI, document 2 now has a scoreof 13.8 and will therefore be returned as the third most relevant document,even though it did not contain the exact query terms “Mark Twain”. Thisis an improvement over the relevance results of the literal term matchingexample.In [5] a rank-k semidiscrete decomposition (SDD) is used in LSI. Therank-k SDD of an m × n matrix A is Ak= XkDkYTk, where Xkis m × k, Dkis k × k, and Ykis n × k. Each entry of Xkand Ykis one of {−1, 0, 1} andDkis diagonal with positive entries. The SDD is discussed in detail in [5].In [5], an implementation of the SDD written by Tamara G. Kolda andDianne P. O’Leary is used to determine that LSI using the SDD performs aswell as LSI using the SVD, but do es so using less storage space and query3time.An objective of this project is to test the performance of other approxi-mate matrix decompositions when used in LSI.1.1.4 Performance MeasurementTo determine how well a document retrieval system performs, we use twomeasurements of performance: precision and recall. We define the followingvariables:Retrieved = number of documents retrievedRelevant = total number of relevant documents to the queryRetRel = number of documents retrieved that are relevant.Precision is defined asP (Retrieved) =RetRelRetrievedand recall is defined asR(Retrieved) =RetRelRelevant.We see that precision is the percentage of retrieved documents that arerelevant to the query and recall is the percentage of relevant documents thathave been retrieved.1.2 Forming a Multidocument SummaryAnother information retrieval problem is forming a multidocument summary.The formulation of this problem is similar to the problem of querying adocument database; however, instead of a term-document matrix, we use aterm-sentence matrix. In this case, each column of the term-sentence matrixcorresponds to a sentence from a document included in the set to summarize.One solution to this problem is to compute a dot product query scorefor each sentence in the term-sentence matrix and return the highest scoringsentences as those to be included in the multidocument summary. A sec-ondary, time permitting objective of this project is to test the performanceof approximate matrix decompositions in place of the term-sentence matrix4in this process, where performance is measured in terms of how similar thecomputed summary is to human summaries.Another existing solution to this problem is presented in [4]. In this solu-tion a multidocument summary is built by first performing single documentsummaries using a hidden Markov model (HMM). The highest scoring sen-tences chosen by the HMM (for all documents in the set) are processed intoa term-sentence matrix, which is then scaled and used in a pivoted QR

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD AMSC 663 - Information Retrieval Through Various Approximate Matrix Decompositions

Sign up for free to view:

Please select your school