Stanford CS 347 - Lecture Notes - D2666251

Home> Schools> Stanford University> Computer Science (CS) > CS 347> Lecture Notes

Stanford CS 347 - Lecture Notes

Course Cs 347- Parallel and Distributed Data Management

Pages 37

Download Save

Unformatted text preview:

CS347 Lecture 5 April 23 2001 Prabhakar Raghavan Today s topics Generalized query operators Evidence accumulation and structured queries Basics of Bayesian networks Bayesian nets for Text Retrieval Structured unstructured queries Adding database like queries A step back Translation depends on IR system s syntax user s sophistication User has an information need Translates need to query Review Browse Navigate Update query System presents docs meeting query Key system s model of querydoc proximity Models of query doc proximity Boolean queries Doc is either in or out for query Vector spaces Doc has non negative proximity to query Evidence accumulation Combine score from multiple sources Bayesian nets and probabilistic methods Infer probability that doc meets user s information need Evidence accumulation View each term in the query as providing partial evidence of match tf idf vector space retrieval is one example Corpus dependent idf depends on corpus In some situations corpus dependent evidence is undesirable Corpus independent evidence When is corpus independent scoring useful When corpus statistics are hard to maintain Distributed indices more later Rapidly changing corpora When stable scores are desired Users get used to issuing a search and seeing a doc with a score of say 0 9303 User subsequently filters by score Show me only docs with score at least 0 9 Corpus independent scoring Document routing is a key application There is a list of standing queries e g bounced check in a bank s email customer service department Each incoming doc email scored against all standing queries Routed to destination customer specialist based on best scoring standing query More on this with automatic classification Typical corpus independent score Use a convex function of tfij e g Score i j 1 exp a tfij a is a tuning constant gives a contribution of query term i for doc j Given a multi term query compute the average contribution over all query terms Bayesian Networks for Text Retrieval Text retrieval Find the best set of documents that satisfies a user s information need Bayesian Network Model causal relationship between events Infer the belief that an event holds based on observations of other events What is a Bayesian network Is a directed acyclic graph Nodes Events or Variables Assume values For our purposes all Boolean Links model dependencies between nodes Toy Example f f 0 3 0 7 f n Finals f Project Due d fd f 0 9 0 3 n 0 1 0 7 t t No Sleep n g g 0 99 0 01 0 1 0 9 d d fd Gloom g 0 99 0 9 g g 0 01 0 1 Triple Latte t 0 4 0 6 f d f d 0 8 0 3 0 2 0 7 Links as dependencies Link Matrix Attached to each node Give influences of parents on that node Nodes with no parent get a prior probability e g f d interior node conditional probability of all combinations of values of its parents e g n g t Independence Assumption Variables not connected by a link no direct conditioning Joint probability obtained from link matrices See examples on next slide Independence Assumption Finals f No Sleep n Project Due d Gloom g Independence assumption P t g f P t g Joint probability P f d n g t P f P d P n f P g f d P t g Triple Latte t Chained inference Evidence a node takes on some value Inference Compute belief probabilities of other nodes conditioned on the known evidence Two kinds of inference Diagnostic and Predictive Computational complexity General network NP hard polytree networks tractable Key inference tool Bayes theorem for any two events a c P a c P a c P c P c a P a Implies for instance P c a P a P a c P c Diagnostic Inference Propagate beliefs through parents of a node Inference rule P a c a P a P c bi P bi a bi P c bi c Diagnostic inference f f 0 3 0 7 f n Finals f Project Due d f 0 9 0 3 n 0 1 0 7 Evidence n true Belief P f n No Sleep n Gloom g Triple Latte t Diagnostic inference Inference Rule P f P n f 0 27 P f n P n P n P f P n f 0 21 P f n P n P n Normalize P f n P f n 1 P n 0 48 Beliefs P f n 0 56 P f n 0 44 Predictive Inference Compute belief of child nodes of evidence Inference rule P c a P c bi P bi a b a bi c Model for Text Retrieval Goal Given a user s information need evidence find probability a doc satisfies need Retrieval model Model docs in a document network Model information need in a query network Bayesian nets for text retrieval Documents d1 r1 d2 r2 c1 c2 r3 c3 q1 Document Network Terms Concepts q2 Query operators AND OR NOT i Information need Query Network Link matrices and probabilities Prior doc probability P d 1 n P r d within document term frequency tf idf based P c r 1 to 1 thesaurus P q c canonical forms of query operators Example Macbeth Hamlet reason reason trouble trouble OR double two NOT User query Document Network Query Network Extensions Prior probs don t have to be 1 n User information need doesn t have to be a query can be words typed in docs read any combination Link matrices can be modified over time User feedback The promise of personalization Computational details Document network built at indexing time Query network built scored at query time Representation Link matrices from docs to any single term are like the postings entry for that term Exercise Consider ranking docs for a 1 term query What is the difference between A cosine based vector space ranking where each doc has tf idf components normalized A Bayesian net in which the link matrices on the docs to term links are normalized tf idf Semi structured search Structured search search by restricting on attribute values as in databases Unstructured search search in unstructured files as in text Semi structured search combine both Terminology Each document has structured fields aka attributes columns free form text Each field assumes one of several possible values e g language French Japanese etc price for products date Fields can be ordered price speed or unordered language color Queries A query is any combination of text query field query A field query specifies one or more values for one or more fields for numerical values ranges possible e g price 5000 Example Find all docs in corpus with Price 10000 Year 1996 Model Toyota and text matches excellent OR good NEAR condition Don t want to hit underlying database Demo Demo Indexing structured portion For each fields order docs by values for that field e g sorted by authors names language Maintain range indices in memory for each value of each attribute like a postings entry counts are like freq in postings …

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 347 - Lecture Notes

Sign up for free to view:

Please select your school