DOC PREVIEW
CR MATH 45 - Matrices, Vector Spaces, and Information Retrieval

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

College of the Redwoods http online redwoods cc ca us instruct darnold laproj 1 100 Matrices Vector Spaces and Information Retrieval Steve Richards and Azuree Lovely Purpose Classical methods of information storage and retrieval inconsistent and lack the capability to handle the volume of information with the advent of digital libraries and the internet The goal of this paper is to show how linear algebra in particular the vector space model could be used to retrieve information more efficiently 2 100 The need for Automated IR In the past documents were indexed by authors titles abstracts key words and subject classifications To retrieve any one of these documents involves searching through a card catalogue manually which incorporates the opinions of the user Then if an abstract or key word list were not Provided a professional indexer or cataloger could have written one equating to more uncertainties But today 3 100 There are 60 000 new books printed annually in the United States The Library of Congress maintains a collection of more than 17 million books and receives 7000 new ones daily There are currently 300 million web pages on the internet with the average search engine acquiring pointers to about 10 million daily Automated IR can handle much larger data bases without prejudice Complications with IR Language disparities of programmers and users 4 100 Complexities of language itself such as polysemy and synonymy Accuracy and Inclusivity Term or phrase weighting The Vector Space Model Lets represent each document as a vector representing the relative frequency a term is used in that document So the document The Chevy Automobile a Mechanical Marvel will be indexed by the terms Chevy Auto and Mechanic s The terms are identified by their roots and any derivation of that root will be returned The vector would be 5 100 T V 1 1 0 0 1 Graphically Z 6 100 y V X Query vector comparison An Example Terms T1 auto mobile motive T2 Chevy T3 Ford T4 motor s T5 mechanic s al Documents D1 The Chevy Automobile A Mechanical Overview D2 Automobiles Inside and Out D3 The Ford Auto that rivaled Chevy s Chevelle D4 A Mechanical Comparison of the motors of Chevy and Ford D5 A Mechanical Look at the motors in Chevy and Ford Automobiles 7 100 Now we describe our database by compiling the document vectors into the columns of a term by document matrix A in which the rows are the term vectors 1 1 1 0 1 T1D1 T1D2 T1D3 T1D4 T1D5 T D T D T D T D T D 0 0 1 1 1 2 1 2 2 2 3 2 4 5 5 A T3D1 T3D2 T3D3 T3D4 T3D5 0 0 1 1 1 T4D1 T4D2 T4D3 T4D4 T4D5 0 0 0 1 1 T5D1 T5D2 T5D3 T5D4 T5D5 1 0 0 1 1 In order to weight each term in relevance to each document and also for query comparison we normalize the matrix 7071 1 5774 0 4772 0 0 5774 5 4772 A 0 0 5774 5 4772 0 0 0 5 4772 7071 0 0 5 4772 8 100 Query comparison A query by a user will be represented as a vector in the same space A user may query the database for Chevy motors in which case the T query vector would be q 0 1 0 1 0 The vectors in the database closest to that vector will be returned as relevant This relevance is determined by the cosine of the angle between them 9 100 cos aTj q kaj kkqk Where kaTj k is the Euclidian norm equal to aT a Graphically this comparison would look like 10 100 z q y v x Query vector comparison A threshold must be set for the minimum acceptable value for cos of those documents returned to the user The cosine of the angles between the document vectors in the database and the query vector are 0 0 4083 7071 and 6324 This query would return the fourth and fifth documents but the second may be the best resource and is not returned The rest of this paper will be devoted to trying to resolve this problem 11 100 Rank Reduction Using QR Factorization 12 100 The next step is to make our system more efficient in handling mass amounts of information The first step in doing so is to remove excess information contained in the column space of A that adds no new insight to the database We can do this by identifying and ignoring dependencies Reducing the rank of our term document matrix can accomplish this and one method for doing this is QR Factorization A QR R t x d upper triangular Q t x t orthogonal The relationship A QR says that the columns of A are linear combinations of the columns of Q Therefore the columns of Q form a basis for the column space of A Returning to our example The factors would be Q 7071 7071 0 0 0 0 0 7071 0 0 0 0 7071 0 0 0 0 0 1 0 7071 7071 0 0 0 13 100 1 7071 4083 3536 6324 0 7071 4083 3536 0 R 0 0 8166 7071 6324 0 5 4472 0 0 0 0 0 0 0 The zero row in R specifies that the last column in Q is a dependent one and can be ignored To reduce the rank of R we block out matrix R We get 1 7071 4083 3536 6324 0 7071 4083 3536 0 R11 R12 R 0 0 8166 7071 6324 0 R22 0 5 4472 0 0 0 0 0 0 0 Because setting R22 equal to zero only produces a 30 change in matrix R the new reduced rank matrix R could be a good approximation to R 14 100 By A QR the new matrix A is 7071 0 A 0 7071 1 5774 0 0 5774 5 0 5774 5 0 0 5 4472 4472 4472 4472 Calculating cos between the query and the new matrix A returns values of 0 0 4083 4083 and 3953 Therefore the change in A was too large Sometimes this may be the case which is why we need to find a better means of obtaining a low rank approximation to matrix A 15 100 Rank Reduction Using Singular Value Decomposition 16 100 QR Factorization identifies dependencies in columns of matrix A removing excess information from the system However dependencies in the row space must also be addressed SVD is one method used for 1 removing those dependencies 2 for producing a low rank approximation to A and 3 for comparing terms to terms in the database A U V T U t x t orthogonal t x d diagonal V d x d orthogonal Where U contains the column space of A V contains the row space of A and contains the singular values of matrix A We can now reduce the rank of A to Ak Uk k VkT by setting all but the k largest singular values of A equal to zero Returning to …


View Full Document

CR MATH 45 - Matrices, Vector Spaces, and Information Retrieval

Documents in this Course
Load more
Download Matrices, Vector Spaces, and Information Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Matrices, Vector Spaces, and Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Matrices, Vector Spaces, and Information Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?