Lecture 13 Text Databases Information Retrieval Part I Oct 17 2007 ChengXiang Zhai CS511 Advanced Database Management Systems 1 The Special Role of Textual Information The most natural way of encoding knowledge Think about scientific literature The most common type of information How much textual information do you produce and consume every day 80 unstructured data vs 20 structured data The most basic form of information It can be used to describe other media of information Natural language is more natural to use for querying CS511 Advanced Database Management Systems 2 What is Text Retrieval TR There exists a collection of text documents User gives a query to express the information need A retrieval system returns relevant documents to users More often called information retrieval or IR But IR tends to be much broader May include non textual information May include text categorization or summarization In industry it s search engine technologies CS511 Advanced Database Management Systems 3 TR vs Database Retrieval Information Unstructured free text vs structured data Ambiguous vs well defined semantics Query Ambiguous vs well defined semantics Incomplete vs complete specification Answers Relevant documents vs matched records TR is an empirically defined problem CS511 Advanced Database Management Systems 4 Landscape of Data Management Query capability Inferences Mining Inexact Matching Exact Matching ed r u ct a u t r St Da d re tu c tru ata s D an iU d e m i t l ta Mu Da Data complexity CS511 Advanced Database Management Systems RDMS Scale 5 Landscape of Data Management Query capability Inferences Mining Inexact Matching TextRetrievalSystem d re tu c tru ata s D an iU d e m i t l ta Mu Da Data complexity CS511 Advanced Database Management Systems Scale 6 History of TR on One Slide Birth of TR 1945 V Bush s article As we may think 1957 H P Luhn s idea of word counting and matching Indexing Evaluation Methodology 1960 s Smart system G Salton s group Cranfield test collection C Cleverdon s group Indexing automatic can be as good as manual TR Models 1970 s 1980 s Large scale Evaluation Applications 1990 s TREC D Harman E Voorhees NIST Google CS511 Advanced Database Management Systems 7 Formal Formulation of TR Vocabulary V w w w of language Query q q q where q V Document d d d where d V Collection C d d Set of relevant documents R q C 1 1 i 2 N m i i1 imi 1 k ij Generally unknown and user dependent Query is a hint on which doc is in R q Task compute R q an approximate R q CS511 Advanced Database Management Systems 8 Computing R q Strategy 1 Document selection R q d C f d q 1 where f d q 0 1 is an indicator function or classifier System must decide if a doc is relevant or not absolute relevance Strategy 2 Document ranking R q d C f d q where f d q is a relevance measure function is a cutoff System must decide if one doc is more likely to be relevant than another relative relevance CS511 Advanced Database Management Systems 9 Document Selection vs Ranking True R q CS511 Advanced Database Management Systems 1 Doc Selection f d q Doc Ranking f d q 0 R q 0 98 d1 0 95 d2 0 83 d3 0 80 d4 0 76 d5 0 56 d6 0 34 d7 0 21 d8 0 21 d9 R q 10 Problems of Doc Selection The classifier is unlikely accurate Over constrained query terms are too specific no relevant documents found Under constrained query terms are too general over delivery It is extremely hard to find the right position between these two extremes Even if it is accurate all relevant documents are not equally relevant CS511 Advanced Database Management Systems 11 Ranking is often preferred Relevance is a matter of degree A user can stop browsing anywhere so the boundary is controlled by the user High recall users would view more items High precision users would view only a few justification Probability Ranking Principle Theoretical Robertson 77 CS511 Advanced Database Management Systems 12 Probability Ranking Principle Robertson 77 As stated by Cooper If a reference retrieval system s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request where the probabilities are estimated as accurately a possible on the basis of whatever data made available to the system for this purpose then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data Robertson provides two formal justifications Assumptions Independent relevance and sequential browsing CS511 Advanced Database Management Systems 13 According to the PRP all we need is A relevance measure function f which satisfies For all q d1 d2 f q d1 f q d2 iff p Rel q d1 p Rel q d2 Most IR research is centered on finding a good f CS511 Advanced Database Management Systems 14 The Notion of Relevance Relevance Rep q Rep d Similarity Different rep similarity P d q or P q d Probabilistic inference P r 1 q d r 0 1 Probability of Relevance Regression Model Fox 83 Generative Model Doc generation Vector space Prob distr model model Salton et al 75 Wong Yao 89 Query generation Different inference system Prob concept space model Wong Yao 95 Classical LM prob Model approach Robertson Ponte Croft 98 Sparck Jones 76 Lafferty Zhai 01a CS511 Advanced Database Management Systems Inference network model Turtle Croft 91 15 Model 1 Relevance Similarity Assumptions Query and document are represented similarly A query can be regarded as a document Relevance d q similarity d q R q d C f d q f q d Rep q Rep d Key issues How to represent query document How to define the similarity measure CS511 Advanced Database Management Systems 16 Vector Space Model Represent a doc query by a term vector Term basic concept e g word or phrase Each term defines one dimension N terms define a high dimensional space Element of vector corresponds to term weight E g d x1 xN xi is importance of term i Measure relevance by the distance between the query vector and document vector in the vector space CS511 Advanced Database Management Systems 17 VS Model illustration Starbucks D9 D11 D2 D5 D3 D10 D4 D6 Query D7 D8 Microsoft CS511 Advanced Database Management Systems Java D1 18 What the VS model doesn t say How to define select the basic concept Concepts are assumed to be orthogonal How to assign weights Weight in query indicates importance of term Weight in doc indicates how well the term characterizes the doc How to define the similarity distance measure CS511 Advanced Database
View Full Document