11Machine LearningCS6375 --- Spring 2015aBayesian Learning (II)Instructor: Yang LiuSlides modified from Dr. Vincent Ng, Tom Mitchell.2Problem Example• Three variables:– Hair = {blond, dark}– Height = {tall, short}– Country = {G, P}• Training data: Values of (Hair, Height, Country) collected over population23Learn Joint Probabilities• Three variables:– Hair = {blond, dark}– Height = {tall, short}– Country = {G, P}• Training data: Values of (Hair, Height, Country) collected over populationJoint Distribution Table4Compute other Joint or Conditional Distributions35Bayes Classifier Example• Three variables:– Hair = {blond, dark}– Height = {tall, short}– Country = {G, P}• Training data: Values of (Hair, Height, Country) collected over populationIf I observe a new individual tall with blond hair, what is themost likely country of origin?Interested in knowingP(C=G|B,T) P(C=P|B,T)6Bayes Classifier• We want to find the value of Y that is the most probable, given the observations X1,..,Xn• Find y such that this is maximum:The maximum is called the Maximum A Posteriori (MAP) estimator47Bayes Classifier• We want to find the value of Y that is the most probable, given the observations X1,..,Xn• Find y such that this is maximum:Not dependent on y 8Bayes Classifier• We want to find the value of Y that is the most probable, given the observations X1,..,Xn• Find y such that this is maximum:59Bayes Classifier• Classification:– Given a new input (x1,..,xn), compute the best class: • Learning:– Collect all the observations (x1,..,xn) for each class yand estimate:10Classifier Example• Three variables:– Hair = {blond, dark}– Height = {tall, short}– Country = {G, P}• Training data: Values of (Hair, Height, Country) collected over populationIf I observe a new individual tall with blond hair, what is the most likely country oforigin?611Classifier Example• Three variables:– Hair = {blond, dark}– Height = {tall, short}– Country = {G, P}• Training data: Values of (Hair, Height, Country) collected over populationIf I observe a new individual tall with blond hair, what is the most likely country oforigin?12Naïve Bayes AssumptionTo make the problem tractable, we often need to make the following conditional independence assumption:which allows us to define the Naïve Bayes Classifier:)|()....|()|()|,...,,(2121yxPyxPyxPyxxxPnn=∏=ii yxP )|(∏∈=iiCyNByxPyPy )|()(maxarg713Naïve Bayes Classifier• Learning:– Collect all the observations (x1,..,xn) for each class yand estimate:• Classification:14Naïve Bayes Classifier• Learning:– Collect all the observations (x1,..,xn) for each class yand estimate:• Classification:How many parameters do we need for the two classifiers: Bayes and Naïve Bayes?815Naïve Bayes Implementation• Small (but important) implementation detail: If n is large, we may be taking the product of a large number of smallfloating-point values. Underflow avoided by taking log.• Take the max of:16Same Example, the Naïve Bayes Way• Three variables:– Hair = {blond, dark}– Height = {tall, short}– Country = {G, P}• Training data: Values of (Hair, Height, Country) collected over population917Same Example, the Naïve Bayes Way• Three variables:– Hair = {blond, dark}– Height = {tall, short}– Country = {G, P}• Training data: Values of (Hair, Height, Country) collected over populationThe variables are not independentso it is only an approximation.The values are of course different, butthe conclusion remains the same0.17 vs. 0.2 for Country = G0.125 vs. 0.1 for Country = P18Naïve Bayes ClassifierYet another classifier.When to use?• Moderate or large training set available• Attributes that describe instances are conditionally independent given classSuccessful applications: • Diagnosis• Classifying text documents1019Naïve Bayes: SubtletiesConditional independence assumption is often violated… but it works surprisingly well anyway.A plausible reason is that to make correct predictions,• Don’t need the probabilities to be estimated correctly• Only need the posterior of the correct class to be largest among the class posteriorsPosteriors are often unrealistically close to 0 or 1∏=inyxPyxxxPi)|()|,...,,(2120Naïve Bayes: SubtletiesWhat if none of the training instances with target values vjhave attribute value αi? Then Add one smoothing:M: # of possible values of ai||1)|(Mnnvapcji++=11Naïve Bayes: SubtletiesGeneral solution is Bayesian estimate (smoothing):Where: n is number of training examples for which v=vjnc: number of examples for which v=vjand α= αip is prior estimate for P(αi|vj)m is weight given to prior (i.e., number of ‘virtual’ examples)2122Naïve Bayes in Text ClassificationClasses can be:• topics (politics, business, entertainment, sports, etc.)• spam vs. non-spam email• positive vs. negative opinion • Many othersNaïve Bayes is among the most effective algorithms What attributes shall we use to represent text documents?1223Text ClassificationRepresent each document by vector of wordsNaïve Bayes conditional independence assumption: One more assumption, position doesn’t matterBag-of-word model, multinomial naïve Bayes classifier∏===)(1)|()|(doclenijkijvwaPvdocPMultinomial distributionMultinomial Naïve Bayes: Learning• From training corpus, extract Vocabulary• Calculate P(cj)• Calculate P(wk|cj)24α=1 add-one smoothing1325Multinomial Naïve Bayes: Testing• Return CNBwhere 26Generative vs. Discriminative ModelsGiven training examples (x1, y1), …, (xn, yn),Discriminative ModelsSelect hypothesis space H to considerFind h from H with lowest training errorArgument: low training error leads to low prediction errorExamples: decision trees, perceptrons, SVMsGenerative ModelsSelect set of distributions to consider for modeling P(X,Y)Find distribution that best matches P(X,Y) on training dataArgument: If match is close enough, we can use Bayes decision ruleExamples: naïve Bayes, HMMs14Generative Model for Multinomial Naïve Bayes 27Text Classification Example28Slide from Dan Jurafsky1529So FarBayes classifier and Naïve Bayes classifierApplicationsNext: Bayes rule in choosing hypothesis 30Hypothesis Selection: An ExampleI have three identical boxes labeled H1, H2 and H3.Into H1 I place 1 black bead, 3 white beads.Into H2 I place 2 black beads, 2 white beads.Into H3 I place 4 black beads, no white beads.I draw a box at random. I remove a bead at
View Full Document