U of I CS 511 - Text Databases & Information Retrieval: Part I - D2205926

Home> Schools> University of Illinois> Computer Science (CS) > CS 511> Text Databases & Information Retrieval: Part I

DOC PREVIEW

U of I CS 511 - Text Databases & Information Retrieval: Part I

School name University of Illinois

Course Cs 511- Artificial Intelligence II

Pages 34

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Lecture 15 Text Databases Information Retrieval Part I Oct 18 2006 ChengXiang Zhai CS511 Advanced Database Management Systems 1 The Special Role of Textual Information The most natural way of encoding knowledge Think about scientific literature The most common type of information How much textual information do you produce and consume every day The most basic form of information It can be used to describe other media of information Most natural to query CS511 Advanced Database Management Systems 2 What is Text Retrieval TR There exists a collection of text documents User gives a query to express the information need A retrieval system returns relevant documents to users More often called information retrieval or IR But IR tends to be much broader May include non textual information May include text categorization or summarization CS511 Advanced Database Management Systems 3 TR vs Database Retrieval Information Unstructured free text vs structured data Ambiguous vs well defined semantics Query Ambiguous vs well defined semantics Incomplete vs complete specification Answers Relevant documents vs matched records TR is an empirically defined problem CS511 Advanced Database Management Systems 4 History of TR on One Slide Birth of TR 1945 V Bush s article As we may think 1957 H P Luhn s idea of word counting and matching Indexing Evaluation Methodology 1960 s Smart system G Salton s group Cranfield test collection C Cleverdon s group Indexing automatic can be as good as manual TR Models 1970 s 1980 s Large scale Evaluation Applications 1990 s TREC D Harman E Voorhees NIST Google CS511 Advanced Database Management Systems 5 Formal Formulation of TR Vocabulary V w w w of language Query q q q where q V Document d d d where d V Collection C d d Set of relevant documents R q C 1 1 i 2 N m i i1 imi 1 k ij Generally unknown and user dependent Query is a hint on which doc is in R q Task compute R q an approximate R q CS511 Advanced Database Management Systems 6 Computing R q Strategy 1 Document selection R q d C f d q 1 where f d q 0 1 is an indicator function or classifier System must decide if a doc is relevant or not absolute relevance Strategy 2 Document ranking R q d C f d q where f d q is a relevance measure function is a cutoff System must decide if one doc is more likely to be relevant than another relative relevance CS511 Advanced Database Management Systems 7 Document Selection vs Ranking True R q CS511 Advanced Database Management Systems 1 Doc Selection f d q Doc Ranking f d q 0 R q 0 98 d1 0 95 d2 0 83 d3 0 80 d4 0 76 d5 0 56 d6 0 34 d7 0 21 d8 0 21 d9 R q 8 Problems of Doc Selection The classifier is unlikely accurate Over constrained query terms are too specific no relevant documents found Under constrained query terms are too general over delivery It is extremely hard to find the right position between these two extremes Even if it is accurate all relevant documents are not equally relevant CS511 Advanced Database Management Systems 9 Ranking is often preferred Relevance is a matter of degree A user can stop browsing anywhere so the boundary is controlled by the user High recall users would view more items High precision users would view only a few justification Probability Ranking Principle Theoretical Robertson 77 CS511 Advanced Database Management Systems 10 Probability Ranking Principle Robertson 77 As stated by Cooper If a reference retrieval system s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request where the probabilities are estimated as accurately a possible on the basis of whatever data made available to the system for this purpose then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data Robertson provides two formal justifications Assumptions Independent relevance and sequential browsing CS511 Advanced Database Management Systems 11 According to the PRP all we need is A relevance measure function f which satisfies For all q d1 d2 f q d1 f q d2 iff p Rel q d1 p Rel q d2 Most IR research is centered on finding a good f CS511 Advanced Database Management Systems 12 The Notion of Relevance Relevance Rep q Rep d Similarity Different rep similarity P d q or P q d Probabilistic inference P r 1 q d r 0 1 Probability of Relevance Regression Model Fox 83 Generative Model Doc generation Vector space Prob distr model model Salton et al 75 Wong Yao 89 Query generation Different inference system Prob concept space model Wong Yao 95 Classical LM prob Model approach Robertson Ponte Croft 98 Sparck Jones 76 Lafferty Zhai 01a CS511 Advanced Database Management Systems Inference network model Turtle Croft 91 13 Model 1 Relevance Similarity Assumptions Query and document are represented similarly A query can be regarded as a document Relevance d q similarity d q R q d C f d q f q d Rep q Rep d Key issues How to represent query document How to define the similarity measure CS511 Advanced Database Management Systems 14 Vector Space Model Represent a doc query by a term vector Term basic concept e g word or phrase Each term defines one dimension N terms define a high dimensional space Element of vector corresponds to term weight E g d x1 xN xi is importance of term i Measure relevance by the distance between the query vector and document vector in the vector space CS511 Advanced Database Management Systems 15 VS Model illustration Starbucks D9 D11 D2 D5 D3 D10 D4 D6 Query D7 D8 Microsoft CS511 Advanced Database Management Systems Java D1 16 What the VS model doesn t say How to define select the basic concept Concepts are assumed to be orthogonal How to assign weights Weight in query indicates importance of term Weight in doc indicates how well the term characterizes the doc How to define the similarity distance measure CS511 Advanced Database Management Systems 17 What s a good basic concept Orthogonal Linearly independent basis vectors Non overlapping in meaning No ambiguity Weights can be assigned automatically and hopefully accurately Many possibilities Words stemmed words phrases latent concept CS511 Advanced Database Management Systems 18 How to Assign Weights Very very important Why weighting Query side Not all terms are equally important Doc side Some terms carry more contents How Two basic heuristics TF Term Frequency Within doc frequency IDF Inverse Document Frequency TF normalization CS511 Advanced Database

View Full Document