Unformatted text preview:

CS 5751 Machine LearningChapter 6 Bayesian Learning 1Bayesian Learning• Bayes Theorem• MAP, ML hypotheses• MAP learners• Minimum description length principle• Bayes optimal classifier• Naïve Bayes learner• Bayesian belief networksCS 5751 Machine LearningChapter 6 Bayesian Learning 2Two Roles for Bayesian MethodsProvide practical learning algorithms:• Naïve Bayes learning• Bayesian belief network learning• Combine prior knowledge (prior probabilities) with observed dataRequires prior probabilities:• Provides useful conceptual framework:• Provides “gold standard” for evaluating other learning algorithms• Additional insight into Occam’s razorCS 5751 Machine LearningChapter 6 Bayesian Learning 3Bayes Theorem• P(h) = prior probability of hypothesis h• P(D) = prior probability of training data D• P(h|D) = probability of h given D• P(D|h) = probability of D given h)()()|()|(DPhPhDPDhP =CS 5751 Machine LearningChapter 6 Bayesian Learning 4Choosing HypothesesGenerally want the most probable hypothesis given the training dataMaximum a posteriori hypothesis hMAP:If we assume P(hi)=P(hj) then can further simplify, and choose the Maximum likelihood (ML) hypothesis)()()|()|(DPhPhDPDhP =)()|(maxarg )()()|(maxarg )|(maxarghPhDPDPhPhDPDhPhHhHhHhMAP∈∈∈===)|(maxargiHhMLhDPhi∈=CS 5751 Machine LearningChapter 6 Bayesian Learning 5Bayes TheoremDoes patient have cancer or not?A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, 0.8% of the entire population have this cancer.P(cancer) = P(¬cancer) =P(+|cancer) = P(-|cancer) =P(+|¬cancer) = P(-|¬cancer) =P(cancer|+) =P(¬cancer|+) =CS 5751 Machine LearningChapter 6 Bayesian Learning 6Some Formulas for Probabilities• Product rule: probability P(A ∧B) of a conjunction of two events A and B:P(A ∧B) = P(A|B)P(B) = P(B|A)P(A)• Sum rule: probability of disjunction of two events A and B:P(A ∨B) = P(A) + P(B) - P(A ∧B)• Theorem of total probability: if events A1,…,Anare mutually exclusive with , then∑==niiAP11)(∑==niiiAPABPBP1)()|()(CS 5751 Machine LearningChapter 6 Bayesian Learning 7Brute Force MAP Hypothesis Learner1. For each hypothesis h in H, calculate the posterior probability2. Output the hypothesis hMAPwith the highest posterior probability)()()|()|(DPhPhDPDhP =)|(maxarg DhPhHhMAP∈=CS 5751 Machine LearningChapter 6 Bayesian Learning 8Relation to Concept LearningConsider our usual concept learning task• instance space X, hypothesis space H, training examples D• consider the FindS learning algorithm (outputs most specific hypothesis from the version space VSH,D)What would Bayes rule produce as the MAP hypothesis?Does FindS output a MAP hypothesis?CS 5751 Machine LearningChapter 6 Bayesian Learning 9Relation to Concept LearningAssume fixed set of instances (x1,…,xm)Assume D is the set of classificationsD = (c(x1),…,c(xm))Choose P(D|h):• P(D|h) = 1 if h consistent with D• P(D|h) = 0 otherwiseChoose P(h) to be uniform distribution• P(h) = 1/|H| for all h in HThen=otherwise0 with consistent is if)|(1DhDhPH,DVSCS 5751 Machine LearningChapter 6 Bayesian Learning 10Learning a Real Valued FunctionfhMLyxeConsider any real-valued target function fTraining examples (xi,di), where diis noisy training value• di= f(xi) + ei• eiis random variable (noise) drawn independently for each xiaccording to some Gaussian distribution with mean = 0Then the maximum likelihood hypothesis hMLis the one thatminimizes the sum of squared errors:∑=∈−=miiiHhMLxhdh12))((minargCS 5751 Machine LearningChapter 6 Bayesian Learning 11Learning a Real Valued Function()()()22222121)(minarg )(maxarg σ)(21maxarg σ)(21πσ21lnmaxarg... instead thisof log natural Maximizeπσ21maxarg )|(maxarg )|(maxarg2σ)(21iiHhiiHhiiHhiiHhMLmiHhmiiHhHhMLxhdxhdxhdxhdhehdphDphixhid−=−−=−−=−−====∈∈∈∈−=∈=∈∈−∏∏CS 5751 Machine LearningChapter 6 Bayesian Learning 12Minimum Description Length PrincipleOccam’s razor: prefer the shortest hypothesisMDL: prefer the hypothesis h that minimizeswhere LC(x) is the description length of x under encoding CExample:• H = decision trees, D = training data labels• LC1(h) is # bits to describe tree h• LC2(D|h) is #bits to describe D given h–Note LC2(D|h) = 0 if examples classified perfectly by h. Need only describe exceptions•Hence hMDLtrades off tree size for training errors)|()(minarg21hDLhLhCCHhMDL+=∈CS 5751 Machine LearningChapter 6 Bayesian Learning 13Minimum Description Length PrincipleInteresting fact from information theory:The optimal (shortest expected length) code for an event with probability p is log2p bits.So interpret (1):-log2P(h) is the length of h under optimal code-log2P(D|h) is length of D given h in optimal code→ prefer the hypothesis that minimizeslength(h)+length(misclassifications)(1) )(log)|(log minarg )(log)|(log maxarg )()|(maxarg2222hPhDPhPhDPhPhDPhHhHhHhMAP−−=+==∈∈∈CS 5751 Machine LearningChapter 6 Bayesian Learning 14Bayes Optimal ClassifierBayes optimal classificationExample:P(h1|D)=.4, P(-|h1)=0, P(+|h1)=1 P(h2|D)=.3, P(-|h2)=1, P(+|h2)=0P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0thereforeand∑∈∈HhiijVvijDhPhvP )|()|(maxarg- )|()|(maxarg =∑∈∈HhiijVvijDhPhvP∑∑∈∈=−=+HhiiHhiiiiDhPhPDhPhP6.)|()|(4.)|()|(CS 5751 Machine LearningChapter 6 Bayesian Learning 15Gibbs ClassifierBayes optimal classifier provides best result, but can be expensive if many hypotheses.Gibbs algorithm:1. Choose one hypothesis at random, according to P(h|D)2. Use this to classify new instanceSurprising fact: assume target concepts are drawn at random from H according to priors on H. Then:E[errorGibbs] ≤ 2E[errorBayesOptimal]Suppose correct, uniform prior distribution over H, then• Pick any hypothesis from VS, with uniform probability• Its expected error no worse than twice Bayes optimalCS 5751 Machine LearningChapter 6 Bayesian Learning 16Naïve Bayes ClassifierAlong with decision trees, neural networks, nearest neighor, one of the most practical learning methods.When to use• Moderate or large training set available• Attributes that describe instances are conditionally independent given classificationSuccessful


View Full Document

U of M CS 5751 - Bayesian Learning

Download Bayesian Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Bayesian Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Bayesian Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?