DOC PREVIEW
Brandeis CS 101A - Bayesian Reasoning

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Foundations of Artificial IntelligenceCS101FALL 2007Bayesian ReasoningAxioms of Probability Theory• All probabilities between 0 and 1• True proposition has probability 1, false hasprobability 0. P(true) = 1 P(false) = 0.• The probability of disjunction is:1)(0 !! AP)()()()( BAPBPAPBAP !"+=#A BBA !Conditional Probability• P(A | B) is the probability of A given B• Assumes that B is all and only information known.• Defined by:)()()|(BPBAPBAP!=ABBA !Independence• A and B are independent iff:• Therefore, if A and B are independent:)()|( APBAP =)()|( BPABP =)()()()|( APBPBAPBAP =!=)()()( BPAPBAP =!These two constraints are logically equivalentJoint Distribution• The joint probability distribution for a set of random variables,X1,…,Xn gives the probability of every combination of values (an n-dimensional array with vn values if all variables are discrete with vvalues, all vn values must sum to 1): P(X1,…,Xn)• The probability of all possible conjunctions (assignments of values tosome subset of variables) can be calculated by summing theappropriate subset of values from the joint distribution.• Therefore, all conditional probabilities can also be calculated.0.010.02blue0.020.20redsquarecircle0.200.20blue0.300.05redsquarecirclepositivenegative25.005.020.0)( =+=! circleredP80.025.020.0)()()|( ==!!!=!circleredPcircleredpositivePcircleredpositiveP57.03.005.002.020.0)( =+++=redPProbabilistic Classification• Let Y be the random variable for the class which takes values{y1,y2,…ym}.• Let X be the random variable describing an instance consisting of avector of values for n features <X1,X2…Xn>, let xk be a possiblevalue for X and xij a possible value for Xi.• For classification, we need to compute P(Y=yi | X=xk) for i=1…m• However, given no other assumptions, this requires a table givingthe probability of each category for each possible instance in theinstance space, which is impossible to accurately estimate from areasonably-sized training set.– Assuming Y and all Xi are binary, we need 2n entries to specifyP(Y=pos | X=xk) for each of the 2n possible xk’s sinceP(Y=neg | X=xk) = 1 – P(Y=pos | X=xk)– Compared to 2n+1 – 1 entries for the joint distribution P(Y,X1,X2…Xn)Bayes TheoremSimple proof from definition of conditional probability:)()()|()|(EPHPHEPEHP =)()()|(EPEHPEHP!=)()()|(HPEHPHEP!=)()|()( HPHEPEHP =!QED:(Def. cond. prob.)(Def. cond. prob.))()()|()|(EPHPHEPEHP =Bayesian Categorization• Determine category of xk by determining for each yi• P(X=xk) can be determined since categories arecomplete and disjoint.)()|()()|(kikikixXPyYxXPyYPxXyYP=======!!==========mikikimikixXPyYxXPyYPxXyYP111)()|()()|(!======miikikyYxXPyYPxXP1)|()()(Bayesian Categorization (cont.)• Need to know:– Priors: P(Y=yi)– Conditionals: P(X=xk | Y=yi)• P(Y=yi) are easily estimated from data.– If ni of the examples in D are in yi then P(Y=yi) = ni / |D|• Too many possible instances (e.g. 2n for binaryfeatures) to estimate all P(X=xk | Y=yi).• Still need to make some sort of independenceassumptions about the features to make learningtractable.Generative Probabilistic Models• Assume a simple (usually unrealistic) probabilistic method bywhich the data was generated.• For categorization, each category has a differentparameterized generative model that characterizes thatcategory.• Training: Use the data for each category to estimate theparameters of the generative model for that category.– Maximum Likelihood Estimation (MLE): Set parameters tomaximize the probability that the model produced the giventraining data.– If M! denotes a model with parameter values ! and Dk is thetraining data for the kth class, find model parameters for class k(!k) that maximize the likelihood of Dk:• Testing: Use Bayesian analysis to determine the categorymodel that most likely generated a specific test instance.)|(argmax!!!MDPkk=Naïve Bayes Generative ModelSize Color Shape Size Color Shape PositiveNegativeposnegposposposnegnegsmmedlglgmedsmsmmedlgredredredredredbluebluegrncirccirccirccircsqrtritricircsqrtrismlgmedsmlgmedlgsmblueredgrnbluegrnredgrnbluecircsqrtricircsqrcirctriCategoryNaïve Bayes Inference ProblemSize Color Shape Size Color Shape PositiveNegativeposnegposposposnegnegsmmedlglgmedsmsmmedlgredredredredredbluebluegrncirccirccirccircsqrtritricircsqrtrismlgmedsmlgmedlgsmblueredgrnbluegrnredgrnbluecircsqrtricircsqrcirctriCategorylg red circ ?? ??Naïve Bayesian Categorization• If we assume features of an instance are independent giventhe category (conditionally independent).• Therefore, we then only need to know P(Xi | Y) for eachpossible pair of a feature-value and a category.• If Y and all Xi and binary, this requires specifying only 2nparameters:– P(Xi=true | Y=true) and P(Xi=true | Y=false) for each Xi– P(Xi=false | Y) = 1 – P(Xi=true | Y)• Compared to specifying 2n parameters without anyindependence assumptions.)|()|,,()|(121!===niinYXPYXXXPYXP LNaïve Bayes Example0.30.9P(circle | Y)0.30.05P(triangle | Y)0.40.05P(square | Y)0.40.05P(green | Y)0.30.05P(blue | Y)0.30.9P(red | Y)0.40.5P(large | Y)0.20.1P(medium | Y)0.40.4P(small | Y)0.50.5P(Y)negativepositiveProbabilityTest Instance:<medium ,red, circle>Naïve Bayes Example0.30.9P(circle | Y)0.30.9P(red | Y)0.20.1P(medium | Y)0.50.5P(Y)negativepositiveProbabilityP(positive | X) = P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive) / P(X) 0.5 * 0.1 * 0.9 * 0.9 = 0.0405 / P(X) P(negative | X) = P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P(X) 0.5 * 0.2 * 0.3 * 0.3 = 0.009 / P(X)P(positive | X) + P(negative | X) = 0.0405 / P(X) + 0.009 / P(X) = 1P(X) = (0.0405 + 0.009) = 0.0495 = 0.0405 / 0.0495 = 0.8181= 0.009 / 0.0495 = 0.1818Test Instance:<medium ,red, circle>Estimating Probabilities• Normally, probabilities are estimated based on observedfrequencies in the training data.• If D contains nk examples in category yk, and nijk of these nkexamples have the jth value for feature Xi, xij, then:• However, estimating such probabilities from small training setsis error-prone.• If due only to chance, a rare feature, Xi, is always false in thetraining data, !yk :P(Xi=true | Y=yk) = 0.• If


View Full Document

Brandeis CS 101A - Bayesian Reasoning

Download Bayesian Reasoning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Bayesian Reasoning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Bayesian Reasoning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?