1 Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 25, 2011 Today: • Naïve Bayes • discrete-valued Xi’s • Document classification • Gaussian Naïve Bayes • real-valued Xi’s • Brain image classification • Form of decision surfaces Readings: Required: • Mitchell: “Naïve Bayes and Logistic Regression” (available on class website) Optional • Bishop 1.2.4 • Bishop 4.2 Naïve Bayes in a Nutshell Bayes rule: Assuming conditional independence among Xi’s: So, classification rule for Xnew = < X1, …, Xn > is:2 Another way to view Naïve Bayes (Boolean Y): Decision rule: is this quantity greater or less than 1? P(S | D,G,M)3 Naïve Bayes: classifying text documents • Classify which emails are spam? • Classify which emails promise an attachment? How shall we represent text documents for Naïve Bayes? Learning to classify documents: P(Y|X) • Y discrete valued. – e.g., Spam or not • X = <X1, X2, … Xn> = document • Xi is a random variable describing…4 Learning to classify documents: P(Y|X) • Y discrete valued. – e.g., Spam or not • X = <X1, X2, … Xn> = document • Xi is a random variable describing… Answer 1: Xi is boolean, 1 if word i is in document, else 0 e.g., Xpleased = 1 Issues? Learning to classify documents: P(Y|X) • Y discrete valued. – e.g., Spam or not • X = <X1, X2, … Xn> = document • Xi is a random variable describing… Answer 2: • Xi represents the ith word position in document • X1 = “I”, X2 = “am”, X3 = “pleased” • and, let’s assume the Xi are iid (indep, identically distributed)5 Learning to classify document: P(Y|X) the “Bag of Words” model • Y discrete valued. e.g., Spam or not • X = <X1, X2, … Xn> = document • Xi are iid random variables. Each represents the word at its position i in the document • Generating a document according to this distribution = rolling a 50,000 sided die, once for each word position in the document • The observed counts for each word follow a ??? distribution Multinomial Distribution6 Multinomial Bag of Words aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 … Zaire 0 MAP estimates for bag of words Map estimate for multinomial What β’s should we choose?7 Naïve Bayes Algorithm – discrete Xi • Train Naïve Bayes (examples) for each value yk! estimate for each value xij of each attribute Xi! estimate • Classify (Xnew) prob that word xij appears in position i, given Y=yk * Additional assumption: word probabilities are position independent8 For code and data, see www.cs.cmu.edu/~tom/mlbook.html click on “Software and Data” What if we have continuous Xi ? Eg., image classification: Xi is real-valued ith pixel9 What if we have continuous Xi ? Eg., image classification: Xi is real-valued ith pixel Naïve Bayes requires P(Xi | Y=yk), but Xi is real (continuous) Common approach: assume P(Xi | Y=yk) follows a Normal (Gaussian) distribution Gaussian Distribution (also called “Normal”) p(x) is a probability density function, whose integral (not sum) is 110 What if we have continuous Xi ? Gaussian Naïve Bayes (GNB): assume Sometimes assume variance • is independent of Y (i.e., σi), • or independent of Xi (i.e., σk) • or both (i.e., σ) • Train Naïve Bayes (examples) for each value yk! estimate* for each attribute Xi estimate • class conditional mean , variance • Classify (Xnew) Gaussian Naïve Bayes Algorithm – continuous Xi (but still discrete Y) * probabilities must sum to 1, so need estimate only n-1 parameters...11 Estimating Parameters: Y discrete, Xi continuous Maximum likelihood estimates: !jth training example δ()=1 if (Yj=yk) else 0 ith feature kth class How many parameters must we estimate for Gaussian Naïve Bayes if Y has k possible values, X=<X1, … Xn>?12 What is form of decision surface for Gaussian Naïve Bayes classifier? eg., if we assume attributes have same variance, indep of Y ( ) GNB Example: Classify a person’s cognitive state, based on brain image • reading a sentence or viewing a picture? • reading the word describing a “Tool” or “Building”? • answering the question, or getting confused?13 Y is the mental state (reading “house” or “bottle”) Xi are the voxel activities, this is a plot of the µ’s defining P(Xi | Y=“bottle”) fMRI activation high below average average Mean activations over all training examples for Y=“bottle” Classification task: is person viewing a “tool” or “building”? statistically significant p<0.05 Classification accuracy14 Where is information encoded in the brain? Accuracies of cubical 27-voxel classifiers centered at each significant voxel [0.7-0.8] Naïve Bayes: What you should know • Designing classifiers based on Bayes rule • Conditional independence – What it is – Why it’s important • Naïve Bayes assumption and its consequences – Which (and how many) parameters must be estimated under different generative models (different forms for P(X|Y) ) • and why this matters • How to train Naïve Bayes classifiers – MLE and MAP estimates – with discrete and/or continuous inputs Xi15 Questions to think about: • Can you use Naïve Bayes for a combination of discrete and real-valued Xi? • How can we easily model just 2 of n attributes as dependent? • What does the decision surface of a Naïve Bayes classifier look like? • How would you select a subset of
View Full Document