DOC PREVIEW
MSU CSE 847 - Bayesian Learning
Course Cse 847-
Pages 24

This preview shows page 1-2-23-24 out of 24 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1OutlineMaximum Likelihood Learning (ML)Maximum A Posterior Learning (MAP)MAPMAPMAPMAPMAPMinimum Description Length (MDL) PrincipleMDLExample: Decision TreeMAP vs. MDLProblems with Maximum ApproachesBayes Optimal Classifier (Bayesian Average)Computational IssuesGibbs ClassifierBagging ClassifiersBoostrap SamplingBagging AlgorithmBagging  Bayesian AverageEmpirical Study of BaggingBias-Variance TradeoffBaggingBayesian LearningRong JinOutline•MAP learning vs. ML learning•Minimum description length principle•Bayes optimal classifier•Bagging{ }w ww�� +��r rrMaximum Likelihood Learning (ML)•Find the best model by maximizing the log-likelihood of the training dataMaximum A Posterior Learning (MAP)•ML learning•Models are determined by training data •Unable to incorporate prior knowledge/preference about models•Maximum a posterior learning (MAP)•Knowledge/preference is incorporated through a prior Prior encodes the knowledge/preferenceMAP•Uninformative prior: regularized logistic regressionMAPConsider text categorization•wi: importance of i-th word in classification•Prior knowledge: the more common the word, the less important it is•How to construct a prior according to the prior knowledge ?MAP•An informative prior for text categorization• i : the occurrence of the i-th word in training dataMAPTwo correlated classification tasks: C1 and C2•How to introduce an appropriate prior to capture this prior knowledge ?MAP•Construct priors to capture the dependence between w1 and w2Minimum Description Length (MDL) Principle•Occam’s razor: prefer a simple hypothesis•Simple hypothesis  short description length•Minimum description length •LC (x) is the description length for message x under coding scheme cBits for encoding hypothesis hBits for encoding data given hMDLDSenderReceiverSend only D ?Send only h ?Send h + D/h ?Example: Decision TreeH = decision trees, D = training data labels•LC1(h) is # bits to describe tree h•LC2(D|h) is # bits to describe D given tree h•LC2(D|h)=0 if examples are classified perfectly by h.•Only need to describe exceptionshMDL trades off tree size for training errorsMAP vs. MDLMAP learningMDL learningProblems with Maximum ApproachesConsiderThree possible hypotheses:Maximum approaches will pick h1 Given new instance xMaximum approaches will output +However, is this most probable result?1 2 3Pr( | ) 0.4, Pr( | ) 0.3, Pr( | ) 0.3h D h D h D= = =1 2 3( ) , ( ) , ( )h x h x h x=+ =- =-Bayes Optimal Classifier (Bayesian Average)Bayes optimal classification:Example:The most probable class is -1 1 12 2 23 3 3Pr( | ) 0.4, Pr( | , ) 1, Pr( | , ) 0Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) Pr( | , ) 0.4, Pr( | ) Pr( | , ) 0.6h hh D h x h xh D h x h xh D h x h xh D h x h D h x= + = - == + = - == + = - =+ = - =� �Computational Issues•Need to sum over all possible hypotheses •It is expensive or impossible when the hypothesis space is large•E.g., decision tree•Solution: sampling !Gibbs ClassifierGibbs algorithm1. Choose one hypothesis at random, according to p(h|D)2. Use this hypothesis to classify new instance•Surprising fact:•Improve by sampling multiple hypotheses from p(h|D) and average their classification results[ ]2Gibbs BayesOptimalE err E err� ��� �Bagging Classifiers•In general, sampling from p(h|D) is difficult•P(h|D) is difficult to compute•P(h|D) is impossible to compute for non-probabilistic classifier such as SVM•Bagging Classifiers:•Realize sampling p(h|D) by sampling training examplesBoostrap SamplingBagging = Boostrap aggregating•Boostrap sampling: given set D containing m training examples•Create Di by drawing m examples at random with replacement from D•Di expects to leave out about 0.37 of examples from DBagging Algorithm•Create k boostrap samples D1, D2,…, Dk•Train distinct classifier hi on each Di•Classify new instance by classifier vote with equal weightsBagging  Bayesian AverageP(h|D)Bayesian Average…h1h2hkSamplingPr( | , )iic h x�DBagging…D1D2DkBoostrap SamplingPr( | , )iic h x�h1h2hkBoostrap sampling is almost equivalent to sampling from posterior P(h|D)Empirical Study of BaggingBagging decision trees•Boostrap 50 different samples from the original training data•Learn a decision tree over each boostrap sample•Predict the class labels for test instances by the majority vote of 50 decision trees•Bagging decision tree outperforms a single decision treeWhy Bagging works better than a single classifier?•Real value case•y~f(x)+, ~N(0,)•(x|D) is a predictor learned from training data DBias-Variance TradeofIrreducible varianceModel bias:The simpler the (x|D), the larger the bias Model variance:The simpler the (x|D), the smaller the varianceBagging•Bagging performs better than a single classifier because it effectively reduces the model variancesingle decision treeBagging decision


View Full Document

MSU CSE 847 - Bayesian Learning

Course: Cse 847-
Pages: 24
Download Bayesian Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Bayesian Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Bayesian Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?