MSU CSE 847 - bayes - D120273

Home> Schools> Michigan State University> Computer Science & Engineering (CSE) > CSE 847> bayes

DOC PREVIEW

MSU CSE 847 - bayes

School name Michigan State University

Course Cse 847-

Pages 27

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Bayesian LearningOutlineMaximum Likelihood Learning (ML)Maximum A Posterior Learning (MAP)Example: Logistic RegressionSlide 6Example (cont’d)Slide 8Minimum Description Length PrincipleSlide 10Example: Decision TreeMAP vs. MDLProblems with Maximum ApproachesBayes Optimal Classifier (Bayesian Average)Slide 15When do We Need Bayesian Average?Computational Issues with Bayes Optimal ClassifierGibbs ClassifierBagging ClassifiersBoostrap SamplingBagging AlgorithmBagging  Bayesian AverageEmpirical Study of BaggingBias-Variance TradeoffSlide 25Slide 26BaggingBayesian LearningRong JinOutlineMAP learning vs. ML learningMinimum description length principleBayes optimal classifierBaggingMaximum Likelihood Learning (ML)Find the model that best model by maximizing the log-likelihood of the training dataLogistic regressionParameters are found by maximizing the likelihood of training data1 21( | ; )1 exp ( ){ , ,..., , }mp y xy x w cw w w cqq=+ - � +� �� =rrv{ }( )* *1, ,1, max ( ) max log1 expntrainiw c w cw c l Dy x w c== =+ - � +� �� r rrr rMaximum A Posterior Learning (MAP)In ML learning, models are solely determined by the training examplesVery often, we have prior knowledge/preference about parameters/modelsML learning is unable to incorporate the prior knowledge/preference on parameters/modelsMaximum a posterior learning (MAP)Knowledge/preference about parameters/models are incorporated through a prior *arg max Pr( | )Dqq q=Prior for parametersPr( )qExample: Logistic RegressionML learningPrior knowledge/PreferenceNo feature should dominate over all other features Prefer small weightsGaussian prior for parameters/models: { }( )* *1, ,1, max ( ) max log1 expntrainiw c w cw c l Dy x w c== =+ - � +� �� r rrr r2211Pr( ) expmiiw ws=� �� -� �� rExample: Logistic RegressionML learningPrior knowledge/PreferenceNo feature should dominate over all other features Prefer small weightsGaussian prior for parameters/models: { }( )* *1, ,1, max ( ) max log1 expntrainiw c w cw c l Dy x w c== =+ - � +� �� r rrr r2211Pr( ) expmiiw ws=� �� -� �� rExample (cont’d)MAP learning for logistic regressionCompared to regularized logistic regression{ }( )( )* *,221 1,, arg max Pr( | , ) Pr( , )arg max log Pr( | , ) log Pr( , )1 1arg max log1 expw cn mii iw cw c D w c w cD w c w cwy x w cqs= === +� �� = -� �+ - � +� �� rrr r rr rr r21 11( ) log1 exp ( )N mreg train ii il D s wy c x w= == -+ - + �� r rExample (cont’d)MAP learning for logistic regressionCompared to regularized logistic regression{ }( )( )* *,221 1,, arg max Pr( | , ) Pr( , )arg max log Pr( | , ) log Pr( , )1 1arg max log1 expw cn mii iw cw c D w c w cD w c w cwy x w cqs= === +� �� = -� �+ - � +� �� rrr r rr rr r21 11( ) log1 exp ( )N mreg train ii il D s wy c x w= == -+ - + �� r rMinimum Description Length PrincipleOccam’s razor: prefer the simplest hypothesisSimplest hypothesis  hypothesis with shortest description lengthMinimum description length Prefer shortest hypothesisLC (x) is the description length for message x under coding scheme c1 2arg min ( ) ( | )MDL C Ch Hh L h L D h�= +# of bits to encode hypothesis h# of bits to encode data D given hComplexity of Model# of MistakesMinimum Description Length Principle1 2arg min ( ) ( | )MDL C Ch Hh L h L D h�= +DSenderReceiverSend only D ?Send only h ?Send h + D/h ?Example: Decision TreeH = decision trees, D = training data labelsLC1(h) is # bits to describe tree hLC2(D|h) is # bits to describe D given tree hNote LC2(D|h)=0 if examples are classified perfectly by h.Only need to describe exceptionshMDL trades off tree size for training errorsMAP vs. MDLMAP learning:Fact from information theoryThe optimal (shortest expected coding length) code for an event with probability p is –log2pInterpret MAP using MDL principle{ }{ }2 22 2arg max Pr( | ) Pr( ) arg max log Pr( | ) log Pr( )arg min log Pr( ) log Pr( | )MAPh H h Hh Hh D h h D h hh D h� ��= = += - -1 2arg min ( ) ( | )MDL C Ch Hh L h L D h�= +Description length of h under optimal codingDescription length of exceptions under optimal codingProblems with Maximum ApproachesConsiderThree possible hypotheses:Maximum approaches will pick h1 Given new instance xMaximum approaches will output +However, is this most probably result?1 2 3Pr( | ) 0.4, Pr( | ) 0.3, Pr( | ) 0.3h D h D h D= = =1 2 3( ) , ( ) , ( )h x h x h x=+ =- =-Bayes Optimal Classifier (Bayesian Average)Bayes optimal classification:Example:The most probably class is -*( ) arg max Pr( | ) Pr( | , )ch Hc x h D c h x�=�1 1 12 2 23 3 3Pr( | ) 0.4, Pr( | , ) 1, Pr( | , ) 0Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) Pr( | , ) 0.4, Pr( | ) Pr( | , ) 0.6h hh D h x h xh D h x h xh D h x h xh D h x h D h x= + = - == + = - == + = - =+ = - =� �Bayes Optimal Classifier (Bayesian Average)Bayes optimal classification:Example:The most probably class is -*( ) arg max Pr( | ) Pr( | , )ch Hc x h D c h x�=�1 1 12 2 23 3 3Pr( | ) 0.4, Pr( | , ) 1, Pr( | , ) 0Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) Pr( | , ) 0.4, Pr( | ) Pr( | , ) 0.6h hh D h x h xh D h x h xh D h x h xh D h x h D h x= + = - == + = - == + = - =+ = - =� �When do We Need Bayesian Average?Bayes optimal classification*( ) arg max Pr( | ) Pr( | , )ch Hc x h D c h x�=�When do we need Bayesian average?Multiple mode caseOptimal mode is flatWhen NOT Bayesian Average?Can’t estimate Pr(h|D) accuratelyComputational Issues with Bayes Optimal ClassifierBayes optimal classificationComputational issues:Need to sum over all possible models/hypotheses hIt is expensive or impossible when the model/hypothesis space is largeExample: decision treeSolution: sampling !*( ) arg max Pr( | ) Pr( | , )ch Hc x h D c h x�=�Gibbs ClassifierGibbs algorithm1. Choose one hypothesis at random, according to P(h|D)2. Use this to classify new instanceSurprising fact:Improve by sampling multiple hypotheses from P(h|D) and average their classification resultsMarkov chain Monte Carlo (MCMC) samplingImportance

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-25-26-27 out of 27 pages.

MSU CSE 847 - bayes

Sign up for free to view:

Please select your school