MLE’s, Bayesian Classifiers and Naïve Bayesand Naïve BayesRequired reading:Required reading: • Mitchell draft chapter, sections 1 and 2. (available on class website)Machine Learning 10-601Machine Learning 10601Tom M. MitchellMachine Learning DepartmentCarnegie Mellon UniversityJanuary 30, 2008Naïve Bayes in a NutshellBayes rule:Assuming conditional independence among Xi’s:So, classification rule for Xnew=<X1, …, Xn> is:Naïve Bayes Algorithm – discrete Xi• Train Naïve Bayes (examples) for each*valueyfor eachvalue ykestimatefor each*valuexof each attributeXfor eachvalue xijof each attribute Xiestimate• Classify (Xnew)*probabilities must sum to 1, so need estimate only n-1 parameters...Estimating Parameters: Y, Xidiscrete-valuedMaximum likelihood estimates (MLE’s):Nb f D Number of items in set D for which Y=ykExample: Live in Sq Hill? P(S|G,D,M)•S=1 iff live in Squirrel Hill•D=1 iff Drive to CMU•S=1 iff live in Squirrel Hill• G=1 iff shop at Giant Eagle•D=1 iff Drive to CMU• M=1 iff Dave Matthews fanExample: Live in Sq Hill? P(S|G,D,M)•S=1 iff live in Squirrel Hill•D=1 iff Drive to CMU•S=1 iff live in Squirrel Hill• G=1 iff shop at Giant Eagle•D=1 iff Drive to CMU• M=1 iff Dave Matthews fanNaïve Bayes: Subtlety #1If unlucky, our MLE estimate for P(Xi| Y) may be zero. (e.g.,X373=Birthday Is January30)zero. (e.g., X373 Birthday_Is_January30)•Why worry about just one parameter out of many?Why worry about just one parameter out of many?•What can be done to avoid this?•What can be done to avoid this?Estimating Parameters: Y, Xidiscrete-valuedMaximum likelihood estimates:MAP estimates (Dirichlet priors):Only difference: “imaginary” examplesNaïve Bayes: Subtlety #2Often the Xiare not really conditionally independent• We use Naïve Bayes in many cases anyway, and it often works pretty wellit often works pretty well– often the right classification, even when not the right probability (see [Domingos&Pazzani, 1996])• What is effect on estimated P(Y|X)?– Special case: what if we add two copies: Xi = XkLearning to classify text documents• Classify which emails are spamClassif hich emails are meeting in ites•Classify which emails are meeting invites• Classify which web pages are student home pagesHow shall we represent text documents forHow shall we represent text documents for Naïve Bayes?Baseline: Bag of Words ApproachBaseline: Bag of Words Approachaardvark0aardvark0about 2all 2Africa 1apple 0anxious 0...gas 1...oil 1…Zaire 0For code and data, seewww.cs.cmu.edu/~tom/mlbook.htmlclick on “Software and Data”What you should know:• Training and using classifiers based on Bayes rule•Conditional independence•Conditional independence– What it is– Why it’s important• Naïve Bayes–What it is– Why we use it so much– Training using MLE, MAP estimates–Discrete variables (Bernoulli) and continuous (Gaussian)() ( )Questions:• Can you use Naïve Bayes for a combination of discrete and real-valued Xi? •How can we easily model just 2 of n attributes as•How can we easily model just 2 of n attributes as dependent?• What does the decision surface of a Naïve Bayes classifier look
View Full Document