DOC PREVIEW
MSU CSE 847 - bayes
Course Cse 847-
Pages 27

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Bayesian LearningOutlineMaximum Likelihood Learning (ML)Maximum A Posterior Learning (MAP)Example: Logistic RegressionSlide 6Example (cont’d)Slide 8Minimum Description Length PrincipleSlide 10Example: Decision TreeMAP vs. MDLProblems with Maximum ApproachesBayes Optimal Classifier (Bayesian Average)Slide 15When do We Need Bayesian Average?Computational Issues with Bayes Optimal ClassifierGibbs ClassifierBagging ClassifiersBoostrap SamplingBagging AlgorithmBagging  Bayesian AverageEmpirical Study of BaggingBias-Variance TradeoffSlide 25Slide 26BaggingBayesian LearningRong JinOutlineMAP learning vs. ML learningMinimum description length principleBayes optimal classifierBaggingMaximum Likelihood Learning (ML)Find the model that best model by maximizing the log-likelihood of the training dataLogistic regressionParameters are found by maximizing the likelihood of training data1 21( | ; )1 exp ( ){ , ,..., , }mp y xy x w cw w w cqq=+ - � +� �� �=rrv{ }( )* *1, ,1, max ( ) max log1 expntrainiw c w cw c l Dy x w c== =+ - � +� �� ��r rrr rMaximum A Posterior Learning (MAP)In ML learning, models are solely determined by the training examplesVery often, we have prior knowledge/preference about parameters/modelsML learning is unable to incorporate the prior knowledge/preference on parameters/modelsMaximum a posterior learning (MAP)Knowledge/preference about parameters/models are incorporated through a prior *arg max Pr( | )Dqq q=Prior for parametersPr( )qExample: Logistic RegressionML learningPrior knowledge/PreferenceNo feature should dominate over all other features Prefer small weightsGaussian prior for parameters/models: { }( )* *1, ,1, max ( ) max log1 expntrainiw c w cw c l Dy x w c== =+ - � +� �� ��r rrr r2211Pr( ) expmiiw ws=� �� -� �� ��rExample: Logistic RegressionML learningPrior knowledge/PreferenceNo feature should dominate over all other features Prefer small weightsGaussian prior for parameters/models: { }( )* *1, ,1, max ( ) max log1 expntrainiw c w cw c l Dy x w c== =+ - � +� �� ��r rrr r2211Pr( ) expmiiw ws=� �� -� �� ��rExample (cont’d)MAP learning for logistic regressionCompared to regularized logistic regression{ }( )( )* *,221 1,, arg max Pr( | , ) Pr( , )arg max log Pr( | , ) log Pr( , )1 1arg max log1 expw cn mii iw cw c D w c w cD w c w cwy x w cqs= === +� �� �= -� �+ - � +� �� �� ��� �rrr r rr rr r21 11( ) log1 exp ( )N mreg train ii il D s wy c x w= == -+ - + �� �� �� �r rExample (cont’d)MAP learning for logistic regressionCompared to regularized logistic regression{ }( )( )* *,221 1,, arg max Pr( | , ) Pr( , )arg max log Pr( | , ) log Pr( , )1 1arg max log1 expw cn mii iw cw c D w c w cD w c w cwy x w cqs= === +� �� �= -� �+ - � +� �� �� ��� �rrr r rr rr r21 11( ) log1 exp ( )N mreg train ii il D s wy c x w= == -+ - + �� �� �� �r rMinimum Description Length PrincipleOccam’s razor: prefer the simplest hypothesisSimplest hypothesis  hypothesis with shortest description lengthMinimum description length Prefer shortest hypothesisLC (x) is the description length for message x under coding scheme c1 2arg min ( ) ( | )MDL C Ch Hh L h L D h�= +# of bits to encode hypothesis h# of bits to encode data D given hComplexity of Model# of MistakesMinimum Description Length Principle1 2arg min ( ) ( | )MDL C Ch Hh L h L D h�= +DSenderReceiverSend only D ?Send only h ?Send h + D/h ?Example: Decision TreeH = decision trees, D = training data labelsLC1(h) is # bits to describe tree hLC2(D|h) is # bits to describe D given tree hNote LC2(D|h)=0 if examples are classified perfectly by h.Only need to describe exceptionshMDL trades off tree size for training errorsMAP vs. MDLMAP learning:Fact from information theoryThe optimal (shortest expected coding length) code for an event with probability p is –log2pInterpret MAP using MDL principle{ }{ }2 22 2arg max Pr( | ) Pr( ) arg max log Pr( | ) log Pr( )arg min log Pr( ) log Pr( | )MAPh H h Hh Hh D h h D h hh D h� ��= = += - -1 2arg min ( ) ( | )MDL C Ch Hh L h L D h�= +Description length of h under optimal codingDescription length of exceptions under optimal codingProblems with Maximum ApproachesConsiderThree possible hypotheses:Maximum approaches will pick h1 Given new instance xMaximum approaches will output +However, is this most probably result?1 2 3Pr( | ) 0.4, Pr( | ) 0.3, Pr( | ) 0.3h D h D h D= = =1 2 3( ) , ( ) , ( )h x h x h x=+ =- =-Bayes Optimal Classifier (Bayesian Average)Bayes optimal classification:Example:The most probably class is -*( ) arg max Pr( | ) Pr( | , )ch Hc x h D c h x�=�1 1 12 2 23 3 3Pr( | ) 0.4, Pr( | , ) 1, Pr( | , ) 0Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) Pr( | , ) 0.4, Pr( | ) Pr( | , ) 0.6h hh D h x h xh D h x h xh D h x h xh D h x h D h x= + = - == + = - == + = - =+ = - =� �Bayes Optimal Classifier (Bayesian Average)Bayes optimal classification:Example:The most probably class is -*( ) arg max Pr( | ) Pr( | , )ch Hc x h D c h x�=�1 1 12 2 23 3 3Pr( | ) 0.4, Pr( | , ) 1, Pr( | , ) 0Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) Pr( | , ) 0.4, Pr( | ) Pr( | , ) 0.6h hh D h x h xh D h x h xh D h x h xh D h x h D h x= + = - == + = - == + = - =+ = - =� �When do We Need Bayesian Average?Bayes optimal classification*( ) arg max Pr( | ) Pr( | , )ch Hc x h D c h x�=�When do we need Bayesian average?Multiple mode caseOptimal mode is flatWhen NOT Bayesian Average?Can’t estimate Pr(h|D) accuratelyComputational Issues with Bayes Optimal ClassifierBayes optimal classificationComputational issues:Need to sum over all possible models/hypotheses hIt is expensive or impossible when the model/hypothesis space is largeExample: decision treeSolution: sampling !*( ) arg max Pr( | ) Pr( | , )ch Hc x h D c h x�=�Gibbs ClassifierGibbs algorithm1. Choose one hypothesis at random, according to P(h|D)2. Use this to classify new instanceSurprising fact:Improve by sampling multiple hypotheses from P(h|D) and average their classification resultsMarkov chain Monte Carlo (MCMC) samplingImportance


View Full Document

MSU CSE 847 - bayes

Course: Cse 847-
Pages: 27
Download bayes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view bayes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view bayes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?