CS276A Text Retrieval and MiningRecap of the last lectureProbabilistic relevance feedbackWhy probabilities in IR?Probabilistic IR topicsThe document ranking problemRecall a few probability basicsThe Probability Ranking PrincipleProbability Ranking PrincipleProbability Ranking Principle (PRP)Slide 11Slide 12Probabilistic Retrieval StrategyProbabilistic RankingBinary Independence ModelSlide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Estimation – key challengeIteratively estimating piProbabilistic Relevance FeedbackPRP and BIRRemoving term independenceFood for thoughtGood and Bad NewsBayesian Networks for Text Retrieval (Turtle and Croft 1990)Bayesian NetworksToy ExampleIndependence AssumptionsChained inferenceModel for Text RetrievalBayesian Nets for IR: IdeaBayesian Nets for IRBayesian nets for text retrievalLink matrices and probabilitiesExample: “reason trouble –two”ExtensionsComputational detailsBayes Nets in IRResourcesSlide 45CS276AText Retrieval and Mining Lecture 10Recap of the last lectureImproving search resultsEspecially for high recall. E.g., searching for aircraft so it matches with plane; thermodynamic with heatOptions for improving results…Global methodsQuery expansionThesauriAutomatic thesaurus generationGlobal indirect relevance feedbackLocal methodsRelevance feedbackPseudo relevance feedbackProbabilistic relevance feedbackRather than reweighting in a vector space…If user has told us some relevant and some irrelevant documents, then we can proceed to build a probabilistic classifier, such as a Naive Bayes model:P(tk|R) = |Drk| / |Dr|P(tk|NR) = |Dnrk| / |Dnr|tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr is the set of known irrelevant documents; Dnrk is the subset that contain tk.Why probabilities in IR?User Information NeedDocumentsDocumentRepresentationDocumentRepresentationQueryRepresentationQueryRepresentationHow to match?How to match?In traditional IR systems, matching between each document andquery is attempted in a semantically imprecise space of index terms.Probabilities provide a principled foundation for uncertain reasoning.Can we use probabilities to quantify our uncertainties?Uncertain guess ofwhether document has relevant contentUnderstandingof user need isuncertainProbabilistic IR topicsClassical probabilistic retrieval modelProbability ranking principle, etc.(Naïve) Bayesian Text Categorization Bayesian networks for text retrievalLanguage model approach to IRAn important emphasis in recent workProbabilistic methods are one of the oldest but also one of the currently hottest topics in IR.Traditionally: neat ideas, but they’ve never won on performance. It may be different now.The document ranking problemWe have a collection of documentsUser issues a queryA list of documents needs to be returnedRanking method is core of an IR system:Ranking method is core of an IR system:In what order do we present documents to the In what order do we present documents to the user?user?We want the “best” document to be first, second best second, etc….Idea: Rank by probability of relevance of the Idea: Rank by probability of relevance of the document w.r.t. information needdocument w.r.t. information needP(relevant|documenti, query)Recall a few probability basicsFor events a and b:Bayes’ RuleOdds:aaxxpxbpapabpbpapabpbapapabpbpbapapabpbpbapbapbap,)()|()()|()()()|()|()()|()()|()()|()()|()(),()(1)()()()(apapapapaOPosteriorPriorThe Probability Ranking Principle “If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.”[1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron; van Rijsbergen (1979:113); Manning & Schütze (1999:538)Probability Ranking PrincipleLet x be a document in the collection. Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance.)()()|()|()()()|()|(xpNRpNRxpxNRpxpRpRxpxRpp(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.Need to find p(R|x) - probability that a document x is relevant.p(R),p(NR) - prior probabilityof retrieving a (non) relevantdocument1)|()|( xNRpxRpR={0,1} vs. NR/RProbability Ranking Principle (PRP)Simple case: no selection costs or other utility concerns that would differentially weight errorsBayes’ Optimal Decision Rulex is relevant iff p(R|x) > p(NR|x)PRP in action: Rank all documents by p(R|x)Theorem:Using the PRP is optimal, in that it minimizes the loss (Bayes risk) under 1/0 lossProvable if all probabilities correct, etc. [e.g., Ripley 1996]Probability Ranking PrincipleMore complex case: retrieval costs.Let d be a documentC - cost of retrieval of relevant documentC’ - cost of retrieval of non-relevant documentProbability Ranking Principle: iffor all d’ not yet retrieved, then d is the next document to be retrievedWe won’t further consider loss/utility from now on))|(1()|())|(1()|( dRpCdRpCdRpCdRpCProbability Ranking PrincipleHow do we compute all those probabilities?Do not know exact probabilities, have to use estimates Binary Independence Retrieval (BIR) – which we discuss later today – is the simplest modelQuestionable assumptions“Relevance” of each document is independent of relevance of other documents.Really, it’s bad to keep on returning duplicatesBoolean model of relevanceThat one has a single step information needSeeing a range of results might let user refine queryProbabilistic Retrieval StrategyEstimate how terms contribute to relevanceHow do things like tf, df, and length influence your judgments about document relevance? One answer is the Okapi formulae (S. Robertson)Combine to find document relevance probabilityOrder documents by decreasing probabilityProbabilistic RankingBasic concept:"For a given query, if we know some documents that are relevant, terms that
View Full Document