Bayesian LearningOutlineMaximum Likelihood Learning (ML)Maximum A Posterior Learning (MAP)Example: Logistic RegressionSlide 6Example (cont’d)Slide 8Minimum Description Length PrincipleSlide 10Example: Decision TreeMAP vs. MDLProblems with Maximum ApproachesBayes Optimal Classifier (Bayesian Average)Slide 15When do We Need Bayesian Average?Computational Issues with Bayes Optimal ClassifierGibbs ClassifierBagging ClassifiersBoostrap SamplingBagging AlgorithmBagging Bayesian AverageEmpirical Study of BaggingBias-Variance TradeoffSlide 25Slide 26BaggingBayesian LearningRong JinOutlineMAP learning vs. ML learningMinimum description length principleBayes optimal classifierBaggingMaximum Likelihood Learning (ML)Find the model that best model by maximizing the log-likelihood of the training dataLogistic regressionParameters are found by maximizing the likelihood of training data1 21( | ; )1 exp ( ){ , ,..., , }mp y xy x w cw w w cqq=+ - � +� �� �=rrv{ }( )* *1, ,1, max ( ) max log1 expntrainiw c w cw c l Dy x w c== =+ - � +� �� ��r rrr rMaximum A Posterior Learning (MAP)In ML learning, models are solely determined by the training examplesVery often, we have prior knowledge/preference about parameters/modelsML learning is unable to incorporate the prior knowledge/preference on parameters/modelsMaximum a posterior learning (MAP)Knowledge/preference about parameters/models are incorporated through a prior *arg max Pr( | )Dqq q=Prior for parametersPr( )qExample: Logistic RegressionML learningPrior knowledge/PreferenceNo feature should dominate over all other features Prefer small weightsGaussian prior for parameters/models: { }( )* *1, ,1, max ( ) max log1 expntrainiw c w cw c l Dy x w c== =+ - � +� �� ��r rrr r2211Pr( ) expmiiw ws=� �� -� �� ��rExample: Logistic RegressionML learningPrior knowledge/PreferenceNo feature should dominate over all other features Prefer small weightsGaussian prior for parameters/models: { }( )* *1, ,1, max ( ) max log1 expntrainiw c w cw c l Dy x w c== =+ - � +� �� ��r rrr r2211Pr( ) expmiiw ws=� �� -� �� ��rExample (cont’d)MAP learning for logistic regressionCompared to regularized logistic regression{ }( )( )* *,221 1,, arg max Pr( | , ) Pr( , )arg max log Pr( | , ) log Pr( , )1 1arg max log1 expw cn mii iw cw c D w c w cD w c w cwy x w cqs= === +� �� �= -� �+ - � +� �� �� ��� �rrr r rr rr r21 11( ) log1 exp ( )N mreg train ii il D s wy c x w= == -+ - + �� �� �� �r rExample (cont’d)MAP learning for logistic regressionCompared to regularized logistic regression{ }( )( )* *,221 1,, arg max Pr( | , ) Pr( , )arg max log Pr( | , ) log Pr( , )1 1arg max log1 expw cn mii iw cw c D w c w cD w c w cwy x w cqs= === +� �� �= -� �+ - � +� �� �� ��� �rrr r rr rr r21 11( ) log1 exp ( )N mreg train ii il D s wy c x w= == -+ - + �� �� �� �r rMinimum Description Length PrincipleOccam’s razor: prefer the simplest hypothesisSimplest hypothesis hypothesis with shortest description lengthMinimum description length Prefer shortest hypothesisLC (x) is the description length for message x under coding scheme c1 2arg min ( ) ( | )MDL C Ch Hh L h L D h�= +# of bits to encode hypothesis h# of bits to encode data D given hComplexity of Model# of MistakesMinimum Description Length Principle1 2arg min ( ) ( | )MDL C Ch Hh L h L D h�= +DSenderReceiverSend only D ?Send only h ?Send h + D/h ?Example: Decision TreeH = decision trees, D = training data labelsLC1(h) is # bits to describe tree hLC2(D|h) is # bits to describe D given tree hNote LC2(D|h)=0 if examples are classified perfectly by h.Only need to describe exceptionshMDL trades off tree size for training errorsMAP vs. MDLMAP learning:Fact from information theoryThe optimal (shortest expected coding length) code for an event with probability p is –log2pInterpret MAP using MDL principle{ }{ }2 22 2arg max Pr( | ) Pr( ) arg max log Pr( | ) log Pr( )arg min log Pr( ) log Pr( | )MAPh H h Hh Hh D h h D h hh D h� ��= = += - -1 2arg min ( ) ( | )MDL C Ch Hh L h L D h�= +Description length of h under optimal codingDescription length of exceptions under optimal codingProblems with Maximum ApproachesConsiderThree possible hypotheses:Maximum approaches will pick h1 Given new instance xMaximum approaches will output +However, is this most probably result?1 2 3Pr( | ) 0.4, Pr( | ) 0.3, Pr( | ) 0.3h D h D h D= = =1 2 3( ) , ( ) , ( )h x h x h x=+ =- =-Bayes Optimal Classifier (Bayesian Average)Bayes optimal classification:Example:The most probably class is -*( ) arg max Pr( | ) Pr( | , )ch Hc x h D c h x�=�1 1 12 2 23 3 3Pr( | ) 0.4, Pr( | , ) 1, Pr( | , ) 0Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) Pr( | , ) 0.4, Pr( | ) Pr( | , ) 0.6h hh D h x h xh D h x h xh D h x h xh D h x h D h x= + = - == + = - == + = - =+ = - =� �Bayes Optimal Classifier (Bayesian Average)Bayes optimal classification:Example:The most probably class is -*( ) arg max Pr( | ) Pr( | , )ch Hc x h D c h x�=�1 1 12 2 23 3 3Pr( | ) 0.4, Pr( | , ) 1, Pr( | , ) 0Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) Pr( | , ) 0.4, Pr( | ) Pr( | , ) 0.6h hh D h x h xh D h x h xh D h x h xh D h x h D h x= + = - == + = - == + = - =+ = - =� �When do We Need Bayesian Average?Bayes optimal classification*( ) arg max Pr( | ) Pr( | , )ch Hc x h D c h x�=�When do we need Bayesian average?Multiple mode caseOptimal mode is flatWhen NOT Bayesian Average?Can’t estimate Pr(h|D) accuratelyComputational Issues with Bayes Optimal ClassifierBayes optimal classificationComputational issues:Need to sum over all possible models/hypotheses hIt is expensive or impossible when the model/hypothesis space is largeExample: decision treeSolution: sampling !*( ) arg max Pr( | ) Pr( | , )ch Hc x h D c h x�=�Gibbs ClassifierGibbs algorithm1. Choose one hypothesis at random, according to P(h|D)2. Use this to classify new instanceSurprising fact:Improve by sampling multiple hypotheses from P(h|D) and average their classification resultsMarkov chain Monte Carlo (MCMC) samplingImportance
View Full Document