MSU CSE 847 - Bayesian Learning - D2226748

Home> Schools> Michigan State University> Computer Science & Engineering (CSE) > CSE 847> Bayesian Learning

MSU CSE 847 - Bayesian Learning

Pages 24

Download Save

Unformatted text preview:

Bayesian Learning Rong Jin Outline r r w w r w MAP learning vs ML learning Minimum description length principle Bayes optimal classifier Bagging Maximum Likelihood Learning ML Find the best model by maximizing the loglikelihood of the training data Maximum A Posterior Learning MAP ML learning Models are determined by training data Unable to incorporate prior knowledge preference about models Maximum a posterior learning MAP Knowledge preference is incorporated through a prior Prior encodes the knowledge preference MAP Uninformative prior regularized logistic regression MAP Consider text categorization wi importance of i th word in classification Prior knowledge the more common the word the less important it is How to construct a prior according to the prior knowledge MAP An informative prior for text categorization i the occurrence of the i th word in training data MAP Two correlated classification tasks C1 and C2 How to introduce an appropriate prior to capture this prior knowledge MAP Construct priors to capture the dependence between w1 and w2 Minimum Description Length MDL Principle Occam s razor prefer a simple hypothesis Simple hypothesis short description length Minimum description length Bits for encoding Bits for encoding data given h hypothesis h LC x is the description length for message x under coding scheme c MDL Sender Send only D Send only h D Send h D h Receiver Example Decision Tree H decision trees D training data labels LC1 h is bits to describe tree h LC2 D h is bits to describe D given tree h LC2 D h 0 if examples are classified perfectly by h Only need to describe exceptions hMDL trades off tree size for training errors MAP vs MDL MAP learning MDL learning Problems with Maximum Approaches Consider Three possible hypotheses Pr h1 D 0 4 Pr h2 D 0 3 Pr h3 D 0 3 Maximum approaches will pick h1 Given new instance x h1 x h2 x h3 x Maximum approaches will output However is this most probable result Bayes Optimal Classifier Bayesian Average Bayes optimal classification Example Pr h1 D 0 4 Pr h1 x 1 Pr h1 x 0 Pr h2 D 0 3 Pr h2 x 0 Pr h2 x 1 Pr h3 D 0 3 Pr h3 x 0 Pr h3 x 1 Pr h D Pr h x 0 4 Pr h D Pr h x 0 6 h The most probable class is h Computational Issues Need to sum over all possible hypotheses It is expensive or impossible when the hypothesis space is large E g decision tree Solution sampling Gibbs Classifier Gibbs algorithm 1 Choose one hypothesis at random according to p h D 2 Use this hypothesis to classify new instance E errGibbs 2 E errBayesOptimal Surprising fact Improve by sampling multiple hypotheses from p h D and average their classification results Bagging Classifiers In general sampling from p h D is difficult P h D is difficult to compute P h D is impossible to compute for nonprobabilistic classifier such as SVM Bagging Classifiers Realize sampling p h D by sampling training examples Boostrap Sampling Bagging Boostrap aggregating Boostrap sampling given set D containing m training examples Create Di by drawing m examples at random with replacement from D Di expects to leave out about 0 37 of examples from D Bagging Algorithm Create k boostrap samples D1 D2 Dk Train distinct classifier hi on each Di Classify new instance by classifier vote with equal weights Bagging Bayesian Average Bayesian Average Bagging D P h D Boostrap Sampling Sampling D1 h1 h2 hk D2 Dk h1 h2 Boostrap sampling is almost equivalent to i Pr c hi x sampling from posterior P h D i Pr c hi x hk Empirical Study of Bagging Bagging decision trees Boostrap 50 different samples from the original training data Learn a decision tree over each boostrap sample Predict the class labels for test instances by the majority vote of 50 decision trees Bagging decision tree outperforms a single decision tree Bias Variance Tradeof Why Bagging works better than a single classifier Real value case y f x N 0 x D is a predictor learned from training data D Irreducible variance Model bias The simpler the x D the larger the bias Model variance The simpler the x D the smaller the variance Bagging Bagging performs better than a single classifier because it effectively reduces the model variance variance bias single decision tree Bagging decision tree

View Full Document


School:
Email:
New Password:
Confirm Password:

MSU CSE 847 - Bayesian Learning

Sign up for free to view:

Please select your school