Rutgers University CS 536 - Bayesian Learning - D2754358

Home> Schools> Rutgers University- The State University of New Jersey> (CS) > CS 536> Bayesian Learning

Rutgers University CS 536 - Bayesian Learning

School name Rutgers University- The State University of New Jersey

Pages 16

Download Save

Unformatted text preview:

1Chapter 6:Bayesian LearningCS 536: Machine LearningLittman (Wu, TA)AdministrationClass cancelled for MondayInteresting talk on Tuesday in Psych101 by Josh Tenenbaum.I bought a new toy, do you like it?2Open Problem Solved #1How does kNN break up input space?(LeRoux) Still polygons, but not necessarily convex. Why?Consider 3NN. For a given set of 3 points, there are 6possible orderings. Each ordering corresponds to aconvex polygonal region formed by intersecting theregion closest to the first point with the region closest tothe second point after the first point is removed,intersected with the region closest to the third point giventhat the first two were removed. The union of the 6convex polygonal regions is the region corresponding tothe three points. The resulting region is polygonallybounded, but might not be convex. However, it is closed.(We’re pretty sure.)Open Problem Solved #2Where do you get the maximum differencebetween Gibbs and Bayes Optimal?We were right that this happens at 1/4 and3/4 and that the difference is 1/8.However, the claim in the notes concernsthe ratio between the two. The Bayeserror (if p<1/2) is p and the Gibbs erroris 2p(1-p). The ratio of these is 2(1-p)which is no bigger than 2. (This happensasymptotically close to the p=0 case.)3Bayesian Learning[Read Ch. 6, except 6.3][Suggested exercises: 6.1, 6.2, 6.6]• Bayes Theorem• MAP, ML hypotheses• MAP learners• Minimum description length principle• Bayes optimal classifier• Naive Bayes learner (if time)• Example: Learning over text data• Bayesian belief networks• Expectation Maximization algorithmRoles for Bayesian MethodsProvides practical learning algorithms:• Naive Bayes learning• Bayesian belief network learning• Combine prior knowledge (priorprobabilities) with observed data• Requires prior probabilitiesProvides useful conceptual framework• Provides “gold standard” for evaluatingother learning algorithms• Additional insight into Occam's razor4Bayes TheoremP(h|D) = P(D|h) P(h) / P(D)•P(h) = prior prob. of hypothesis h• P(D) = prior prob. of training dataD• P(h|D) = probability of h given D• P(D|h) = probability of D given hChoosing HypothesesGenerally want most probable hypothesisgiven the training data, or maximum aposteriori hypothesis hMAP:hMAP = argmaxh in H P(h|D)= argmaxh in H P(D|h) P(h) / P(D)= argmaxh in H P(D|h) P(h)If assume P(hi) = P(hj) then can furthersimplify, and choose the maximumlikelihood (ML) hypothesishML = argmaxhi in H P(D|hi)5Bayes TheoremDoes patient have cancer or not?– A patient takes a lab test and the resultcomes back positive. The test returns acorrect positive result in only 98% of thecases in which the disease is actually present,and a correct negative result in only 97% ofthe cases in which the disease is not present.Furthermore, .008 of the entire populationhave this cancer.P (cancer) = P (not cancer) =P (+|cancer) = P (-|cancer) =P (+|not cancer) = P (-|not cancer) =Basic Formulas for Probs• Product Rule: probability P (A ^ B) of aconjunction of two events A and B:P (A ^ B) = P (A|B)P (B) = P (B|A)P (A)• Sum Rule: probability of a disjunction oftwo events A and B:P (A v B) = P (A) + P (B) - P (A ^ B )• Theorem of total probability: if the eventsA1, …., An are mutually exclusive withSi=1n P (Ai) = 1, thenP (B) = Si=1n P (B|Ai) P (Ai)6Brute Force MAP Learner1. For each hypothesis h in H,calculate the posterior probabilityP(h|D) = P(D|h) P(h) / P(D)2. Output the hypothesis hMAP with thehighest posterior probabilityhMAP = argmaxh in H P(h|D)Evolution of Posterior Probs• As data is added, certainty ofhypotheses increases.• Entropy?7Real Valued FunctionsConsider any real-valued target function fTraining examples <xi, di>, where di isnoisy training value• di = f(xi) + ei• ei is random variable (noise) drawnindependently for each xi according tosome Gaussian distribution with mean=0Then the maximum likelihood hypothesishML is the one that minimizes the sum ofsquared errors:hML = argminh in H Si=1m (di-h(xi))2MAP and Least Squares8MAP/Least Squares ProofhMAP = argmaxh in H P(h|D) = argmaxh in H P(D|h) = argmaxh in H Pi =1m 1/sqrt(2ps2) exp(-1/2((di-h(xi))/s)2) = argmaxh in H Si =1m ln 1/sqrt(2ps2) -1/2((di-h(xi))/s)2 = argmaxh in H Si =1m -1/2((di-h(xi))/s)2 = argmaxh in H Si =1m - (di-h(xi))2 = argminh in H Si =1m (di-h(xi))2Predicting ProbabilitiesConsider predicting survivalprobability from patient dataTraining examples <xi, di>, where diis 1 or 0Want to train neural network tooutput a probability given xi (not a 0or 1)9Predicting ProbabilitiesIn this case can showhML = argmaxh in HSi=1m (di ln h(xi) + (1-di) ln(1-h(xi)))Weight update rule for a sigmoid unit:wjk ¨ wjk+Dwjkwhere Dwjk = h Si=1m (di - h(xi)) xijkMDL PrincipleMinimum Description Length PrincipleOccam's razor: prefer the shortesthypothesisMDL: prefer the hypothesis h thatminimizeshMDL = argminh in H (LC1(h) + LC2(D|h))where LC(x) is the description lengthof x under encoding C10MDL ExampleExample: H = decision trees, D = trainingdata labels• LC1(h) is # bits to describe tree h• LC2(D|h) is # bits to describe D given h– Note LC2(D|h) = 0 if examples classifiedperfectly by h. Need only describe exceptions• Hence hMDL trades off tree size fortraining errorsMDL JustificationhMAP = argmaxh in H P(D|h) P(h)= argmaxh in H (log2 P(D|h) +log2 P(h))= argminh in H (-log2 P(D|h) -log2 P(h))From information theory:The optimal (shortest expected codinglength) code for an event with probabilityp is -log2 pSo, prefer the hypothesis that minimizeslength(h) + length(misclassifications)11Classifying New InstancesSo far we've sought the most probablehypothesis given the data D (i.e.,hMAP)Given new instance x, what is its mostprobable classification?•hMAP(x) is not the most probableclassification!Classification ExampleConsider:• Three possible hypotheses:P(h1|D) = .4, P(h2|D) = .3, P(h3|D) = .3• Given new instance x,h1(x) = +, h2(x) = -, h3(x) = -• What’s hMAP(x) ?• What's most probable classificationof x?12Bayes Optimal ClassifierBayes optimal classification:argmaxvj in V Shi in H P(vj|hi) P(hi|D)Example:•P(h1|D) = .4, P(-|h1) = 0, P(+|h2) = 1•P(h2|D) = .3, P(-|h2) = 1, P(+|h3) = 0•P(h3|D) = .3, P(-|h3) = 1, P(+|h3) = 0therefore Shi in H P (+|hi) P (hi|D) = .4 Shi in H P ( -|hi) P (hi|D) = .6MAP classGibbs ClassifierBayes optimal classifier provides bestresult, but can be expensive if manyhypotheses.Gibbs

View Full Document


School:
Email:
New Password:
Confirm Password:

Rutgers University CS 536 - Bayesian Learning

Sign up for free to view:

Please select your school