Unformatted text preview:

CS 502 Spring 2001 Guest Lecture Automated Information Retrieval William Y Arms 1 Types of Information Discovery media type text image video audio etc linking searching CS 502 natural language processing CS 474 2 automatic browsing catalogs indexes metadata user in loop This Lecture media type text image video audio etc linking searching CS 502 natural language processing CS 474 3 automatic browsing catalogs indexes metadata user in loop Automated Information Discovery Creating metadata records manually is labor intensive and hence expensive The aim of automated information discovery is for users to discover information without using skilled human effort to build indexes 4 Resources for Automated Information Discovery Computer power brute force computing ranking methods automatic generation of metadata The intelligence of the user browsing relevance feedback information visualization 5 Brute Force Computing Few people really understand Moore s Law Computing power doubles every 18 months Increases 100 times in 12 years Increases 10 000 times in 25 years Simple algorithms immense computing power may outperform human intelligence 6 Contrast with Old Fashioned Boolean Searching Traditional information retrieval uses Boolean retrieval where a document either matches a query exactly or not at all Encourages short queries Requires precise choice of index terms Requires precise formulation of queries professional training Modern retrieval systems rank documents depending on the likelihood that they satisfy the user s needs Encourages long queries to have as many dimensions as possible Benefits from large numbers of index terms Permits queries with many terms not all of which need match the document 7 Similarity Ranking Ranking methods using similarity Measure the degree of similarity between a query and a document or between two documents Basic technique is the vector space model with term weighting Similar Requests Documents Similar How similar is document to a request 8 Document Ranking Methods using document ranking Rank the documents using some algorithm to rank them in order of importance Best known technique is the Google PageRank algorithm 9 Vector Space Methods Concept n dimensional space where n is the total number of different terms words in a set of documents Each document is represented by a vector with magnitude in each dimension equal to the weighted number of times that the corresponding term appears in the document Similarity between two documents is the angle between their vectors Much of this work was carried out by Gerald Salton and colleagues in Cornell s computer science department 10 Example 1 Incidence Array terms in d1 ant ant bee terms in d2 bee hog ant dog terms in d3 cat gnu dog eel fox terms ant bee cat dog eel fox gnu hog d1 1 1 d2 1 1 d3 1 1 1 1 1 1 1 Weights tij 1 if document i contains term j and zero otherwise 11 Three Terms Represented in 3 Dimensions t3 d1 d2 t2 t1 12 Vector Space Revision x x1 x2 x3 xn is a vector in an n dimensional vector space Length of x is given by extension of Pythagoras s theorem x 2 x12 x22 x32 xn2 If x1 and x2 are vectors Inner product or dot product is given by x1 x2 x11x21 x12x22 x13x23 x1nx1n Cosine of the angle xbetween x the vectors x1 and x2 1 2 cos x1 x2 13 Example 1 continued Similarity of documents in example d1 14 d2 d3 d1 1 0 71 0 d2 0 71 1 0 22 d3 0 0 22 1 Reasons for Term Weighting Similarity using an incidence matrix measures the occurrences of terms but no other characteristics of the documents Terms are more useful for information retrieval if they appear several times in one document weighting by term frequency only appear in some documents weighting by document frequency appear in short document weighting by document length 15 Weighting Very simple approach basic tf idf wij fij dj Where wij is the weight given to term j in document i fij is the frequency with which term j appears in document i dj is the number of documents that contain term j 16 Term Frequency Concept A term that appears many times within a document is likely to be more important than a term that appears only once 17 Term Frequency Simple weighting to reflect term frequency Suppose term j appears fij times in document i i Let mi max fij i e mi is the maximum frequency of any term in document i Term frequency tf tfij fij mi 18 Example 2 Frequency Array terms in d1 ant ant bee terms in d2 bee hog ant dog terms in d3 cat gnu dog eel fox ant bee cat dog eel fox gnu hog d1 2 1 d2 1 1 d3 1 1 1 1 1 1 1 Weights tij frequency that term j occurs in document i 19 Example 2 continued Similarity of documents in example d1 d2 d3 d1 1 0 67 0 d2 0 67 1 0 22 d3 0 0 22 1 Similarity depends upon the weights given to the terms 20 Inverse Document Frequency Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents 21 Inverse Document Frequency For term j number of documents n document frequency number of documents in which term j occurs dj One possible measure is n dj But this over emphasizes small differences Therefore a more useful definition is Inverse document frequency IDF idfj log2 n dj 1 22 Weighting Weights of the following form perform well in a wide variety of circumstances Weight of term j in document i Term frequency Inverse document frequency The standard weighting scheme is wij tfij idfj fij mi log2 n dj 1 Experience shows that the effectiveness depends on characteristics of the collection There are few general improvements beyond this simple weighting scheme 23 PageRank Algorithm Google Concept The rank of a web page is higher if many pages link to it Links from highly ranked pages are given greater weight than links from less highly ranked pages 24 Page Ranks Citing page P1 P1 P2 P3 1 1 P2 Cited page 25 P5 P6 1 1 P3 1 P4 1 P5 1 1 2 1 1 1 P6 Number P4 1 1 4 2 2 1 2 Normalize by Number of Links from Page Citing page P1 P1 P2 P3 1 0 25 P2 Cited page 26 P5 P6 0 5 0 25 P3 0 5 P4 0 5 P5 0 5 0 25 2 0 5 0 5 0 5 P6 Number P4 1 0 25 4 2 2 0 5 2 B Calculating the Weights Initially all pages have weight 1 1 w1 1 1 1 1 1 27 Recalculate Iterate weights wk 1 Bwk 1 75 0 25 w2 Bw1 0 50 1 75 1 00 0 75 until iteration convergences Page Ranks Basic Algorithm Iteration converges when w Bw This w is the high order eigenvector of B It ranks the pages by links to them normalized by the number of citations from each page and weighted by the ranking of the cited pages 28 …


View Full Document

CORNELL CS 502 - Automated Information Retrieval

Loading Unlocking...
Login

Join to view Automated Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Automated Information Retrieval and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?