Unformatted text preview:

Today s topics Generalized query operators CS347 Evidence accumulation and structured queries Basics of Bayesian networks Bayesian nets for Text Retrieval Lecture 5 April 23 2001 Structured unstructured queries Adding database like queries Prabhakar Raghavan A step back Models of query doc proximity Translation depends on IR system s syntax user s sophistication User has an information need Translates need to query Boolean queries Doc is either in or out for query Vector spaces Doc has non negative proximity to query Review Browse Navigate Update query System presents docs meeting query Key system s model of querydoc proximity Evidence accumulation Combine score from multiple sources Bayesian nets and probabilistic methods Infer probability that doc meets user s information need 1 Evidence accumulation View each term in the query as providing partial evidence of match tf idf vector space retrieval is one example Corpus dependent idf depends on corpus In some situations corpus dependent evidence is undesirable Corpus independent evidence When is corpus independent scoring useful When corpus statistics are hard to maintain Distributed indices more later Rapidly changing corpora When stable scores are desired Users get used to issuing a search and seeing a doc with a score of say 0 9303 User subsequently filters by score Show me only docs with score at least 0 9 Corpus independent scoring Document routing is a key application There is a list of standing queries e g bounced check in a bank s email customer service department Each incoming doc email scored against all standing queries Routed to destination customer specialist based on best scoring standing query Typical corpus independent score Use a convex function of tfij e g Score i j 1 exp a tfij a is a tuning constant gives a contribution of query term i for doc j Given a multi term query compute the average contribution over all query terms More on this with automatic classification 2 Bayesian Networks for Text Retrieval What is a Bayesian network Text retrieval Find the best set of documents that satisfies a user s information need Bayesian Network Links model dependencies between nodes Toy Example f 0 3 0 7 Finals f Project Due d Events or Variables Assume values For our purposes all Boolean Model causal relationship between events Infer the belief that an event holds based on observations of other events f Is a directed acyclic graph Nodes Links as dependencies d d 0 4 0 6 Link Matrix Attached to each node n f f 0 9 0 3 n 0 1 0 7 t t No Sleep n g 0 99 0 01 g 0 1 0 9 fd fd Gloom g 0 99 0 9 g g 0 01 0 1 Triple Latte t f d f d 0 8 0 3 0 2 0 7 Give influences of parents on that node Nodes with no parent get a prior probability e g f d interior node conditional probability of all combinations of values of its parents e g n g t 3 Independence Assumption Variables not connected by a link no direct conditioning Joint probability obtained from link matrices Independence Assumption Project Due d Finals f No Sleep n Independence assumption P t g f P t g Gloom g Joint probability See examples on next slide P f d n g t P f P d P n f P g f d P t g Triple Latte t Chained inference Evidence a node takes on some value Inference Compute belief probabilities of other nodes conditioned on the known evidence Two kinds of inference Diagnostic and Predictive Computational complexity Key inference tool Bayes theorem for any two events a c P a c P a c P c P c a P a Implies for instance P a c P c a P a P c General network NP hard polytree networks tractable 4 Diagnostic Inference Diagnostic inference Propagate beliefs through parents of a node Inference rule P a c a P a P c bi P bi a bi P c bi c Rule 0 27 P f P n f P f n P n P n 0 21 P f P n f P f n P n P n Normalize P f n P f n 1 P n 0 48 Beliefs P f n 0 56 P f n 0 44 0 3 0 7 n Project Due d Finals f f f 0 9 0 3 No Sleep n n 0 1 0 7 Diagnostic inference Inference f f Evidence n true Gloom g Triple Latte t Belief P f n Predictive Inference Compute belief of child nodes of evidence Inference rule P c a P c bi P bi a b a bi c 5 Model for Text Retrieval Goal Given a user s information need evidence find probability a doc satisfies need Retrieval model Model docs in a document network Model information need in a query network Bayesian nets for text retrieval Documents d1 r1 c1 c2 within document term frequency tf idf based P c r 1 to 1 thesaurus P q c canonical forms of query operators Concepts c3 Query Network q2 Query operators AND OR NOT i Link matrices and probabilities Document Network r3 Terms r2 q1 Prior doc probability P d 1 n P r d d2 Information need Example Macbeth Hamlet reason reason trouble trouble OR Document Network double two Query Network NOT User query 6 Extensions Prior probs don t have to be 1 n User information need doesn t have to be a query can be words typed in docs read any combination Link matrices can be modified over time Computational details Document network built at indexing time Query network built scored at query time Representation Link matrices from docs to any single term are like the postings entry for that term User feedback The promise of personalization Exercise Consider ranking docs for a 1 term query What is the difference between A cosine based vector space ranking where each doc has tf idf components normalized A Bayesian net in which the link matrices on the docs to term links are normalized tf idf Semi structured search Structured search search by restricting on attribute values as in databases Unstructured search search in unstructured files as in text Semi structured search combine both 7 Terminology Each document has structured fields aka attributes columns free form text Each field assumes one of several possible values e g language French Japanese etc price for products date Queries A query is any combination of text query field query A field query specifies one or more values for one or more fields for numerical values ranges possible e g price 5000 Fields can be ordered price speed or unordered language color Example Indexing structured portion Find all docs in corpus with For each fields order docs by values for that field Price 10000 Year 1996 Model Toyota and text matches excellent OR good NEAR condition Don t want to hit underlying database e g sorted by authors names language Maintain range indices in memory for each value of each attribute like a postings entry counts …


View Full Document

Stanford CS 347 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?