Slide 1OutlineMaximum Likelihood Learning (ML)Maximum A Posterior Learning (MAP)MAPMAPMAPMAPMAPMinimum Description Length (MDL) PrincipleMDLExample: Decision TreeMAP vs. MDLProblems with Maximum ApproachesBayes Optimal Classifier (Bayesian Average)Computational IssuesGibbs ClassifierBagging ClassifiersBoostrap SamplingBagging AlgorithmBagging Bayesian AverageEmpirical Study of BaggingBias-Variance TradeoffBaggingBayesian LearningRong JinOutline•MAP learning vs. ML learning•Minimum description length principle•Bayes optimal classifier•Bagging{ }w ww�� +��r rrMaximum Likelihood Learning (ML)•Find the best model by maximizing the log-likelihood of the training dataMaximum A Posterior Learning (MAP)•ML learning•Models are determined by training data •Unable to incorporate prior knowledge/preference about models•Maximum a posterior learning (MAP)•Knowledge/preference is incorporated through a prior Prior encodes the knowledge/preferenceMAP•Uninformative prior: regularized logistic regressionMAPConsider text categorization•wi: importance of i-th word in classification•Prior knowledge: the more common the word, the less important it is•How to construct a prior according to the prior knowledge ?MAP•An informative prior for text categorization• i : the occurrence of the i-th word in training dataMAPTwo correlated classification tasks: C1 and C2•How to introduce an appropriate prior to capture this prior knowledge ?MAP•Construct priors to capture the dependence between w1 and w2Minimum Description Length (MDL) Principle•Occam’s razor: prefer a simple hypothesis•Simple hypothesis short description length•Minimum description length •LC (x) is the description length for message x under coding scheme cBits for encoding hypothesis hBits for encoding data given hMDLDSenderReceiverSend only D ?Send only h ?Send h + D/h ?Example: Decision TreeH = decision trees, D = training data labels•LC1(h) is # bits to describe tree h•LC2(D|h) is # bits to describe D given tree h•LC2(D|h)=0 if examples are classified perfectly by h.•Only need to describe exceptionshMDL trades off tree size for training errorsMAP vs. MDLMAP learningMDL learningProblems with Maximum ApproachesConsiderThree possible hypotheses:Maximum approaches will pick h1 Given new instance xMaximum approaches will output +However, is this most probable result?1 2 3Pr( | ) 0.4, Pr( | ) 0.3, Pr( | ) 0.3h D h D h D= = =1 2 3( ) , ( ) , ( )h x h x h x=+ =- =-Bayes Optimal Classifier (Bayesian Average)Bayes optimal classification:Example:The most probable class is -1 1 12 2 23 3 3Pr( | ) 0.4, Pr( | , ) 1, Pr( | , ) 0Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) 0.3, Pr( | , ) 0, Pr( | , ) 1Pr( | ) Pr( | , ) 0.4, Pr( | ) Pr( | , ) 0.6h hh D h x h xh D h x h xh D h x h xh D h x h D h x= + = - == + = - == + = - =+ = - =� �Computational Issues•Need to sum over all possible hypotheses •It is expensive or impossible when the hypothesis space is large•E.g., decision tree•Solution: sampling !Gibbs ClassifierGibbs algorithm1. Choose one hypothesis at random, according to p(h|D)2. Use this hypothesis to classify new instance•Surprising fact:•Improve by sampling multiple hypotheses from p(h|D) and average their classification results[ ]2Gibbs BayesOptimalE err E err� ��� �Bagging Classifiers•In general, sampling from p(h|D) is difficult•P(h|D) is difficult to compute•P(h|D) is impossible to compute for non-probabilistic classifier such as SVM•Bagging Classifiers:•Realize sampling p(h|D) by sampling training examplesBoostrap SamplingBagging = Boostrap aggregating•Boostrap sampling: given set D containing m training examples•Create Di by drawing m examples at random with replacement from D•Di expects to leave out about 0.37 of examples from DBagging Algorithm•Create k boostrap samples D1, D2,…, Dk•Train distinct classifier hi on each Di•Classify new instance by classifier vote with equal weightsBagging Bayesian AverageP(h|D)Bayesian Average…h1h2hkSamplingPr( | , )iic h x�DBagging…D1D2DkBoostrap SamplingPr( | , )iic h x�h1h2hkBoostrap sampling is almost equivalent to sampling from posterior P(h|D)Empirical Study of BaggingBagging decision trees•Boostrap 50 different samples from the original training data•Learn a decision tree over each boostrap sample•Predict the class labels for test instances by the majority vote of 50 decision trees•Bagging decision tree outperforms a single decision treeWhy Bagging works better than a single classifier?•Real value case•y~f(x)+, ~N(0,)•(x|D) is a predictor learned from training data DBias-Variance TradeofIrreducible varianceModel bias:The simpler the (x|D), the larger the bias Model variance:The simpler the (x|D), the smaller the varianceBagging•Bagging performs better than a single classifier because it effectively reduces the model variancesingle decision treeBagging decision
View Full Document