Ensemble Methods Boosting Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and Rob Schapire Last Time Variance reduction via bagging Generate new training data sets by sampling with replacement from the empirical distribution Learn a classifier for each of the newly sampled sets Combine the classifiers for prediction Today how to reduce bias for binary classification problems 2 Boosting How to translate rules of thumb i e good heuristics into good learning algorithms For example if we are trying to classify email as spam or not spam a good rule of thumb may be that emails containing Nigerian prince or business opportunity are likely to be spam most of the time 3 Boosting Freund Schapire Theory for weak learners in late 80 s Weak Learner performance on any training set is slightly better than chance prediction Intended to answer a theoretical question not as a practical way to improve learning Tested in mid 90 s using not so weak learners Works anyway 4 PAC Learning Given i i d samples from an unknown arbitrary distribution Strong PAC learning algorithm For any distribution with high probability given polynomially many samples and polynomial time can find classifier with arbitrarily small error Weak PAC learning algorithm Same but error only needs to be slightly better than random guessing e g accuracy only needs to exceed 50 for binary classification Does weak learnability imply strong learnability 5 Boosting 1 Weight all training samples equally 2 Train model on training set 3 Compute error of model on training set 4 Increase weights on training cases model gets wrong 5 Train new model on re weighted training set 6 Re compute errors on weighted training set 7 Increase weights again on cases model gets wrong Repeat until tired Final model weighted prediction of each model 6 Boosting Graphical Illustration 1 2 sign 7 1 1 1 round by minimizing the weighted error 1 AdaBoost 1 2 For a Initialize the data weights for the first round as Select a classifier 1 for the b Compute 1 c Update the weights ln 1 2 1 1 exp 2 1 8 AdaBoost 1 2 For a Initialize the data weights for the first round as Select a classifier 1 for the b Compute 1 c Update the weights ln 1 2 1 Weighted number of incorrect classifications of the classifier 1 1 1 round by minimizing the weighted error 1 1 exp 2 1 9 AdaBoost 1 2 For a Initialize the data weights for the first round as Select a classifier 1 for the b Compute 1 1 1 1 round by minimizing the weighted error 1 c Update the weights ln 1 2 1 0 1 exp 2 1 10 AdaBoost 1 2 For a Initialize the data weights for the first round as Select a classifier 1 for the b Compute 1 1 1 1 round by minimizing the weighted error 1 c Update the weights ln 1 2 1 5 0 1 exp 2 1 11 AdaBoost 1 2 For a Initialize the data weights for the first round as Select a classifier 1 for the b Compute 1 1 1 1 round by minimizing the weighted error 1 c Update the weights ln 1 2 1 1 1 exp 2 1 12 AdaBoost 1 2 For a Initialize the data weights for the first round as Select a classifier 1 for the b Compute 1 c Update the weights ln 1 2 1 1 1 1 round by minimizing the weighted error 1 1 exp 2 1 Normalization constant 13 Example Consider a classification problem where vertical and horizontal lines and their corresponding half spaces are the weak learners 1 2 1 1 3 1 42 2 2 21 2 65 14 3 3 3 14 3 92 Final Hypothesis 42 65 92 15 Boosting Theorem Let 2 1 and 1 2 1 1 So even if all of the find a weak learner the training error goes to zero as 1 s are small positive numbers i e can always increases 2 1 4 1 16 Margins Boosting We can see that training error goes down but what about test error That is does boosting help us generalize better To answer this question we need to look at how confident we are in our predictions How can we measure this 17 Margins Boosting We can see that training error goes down but what about test error That is does boosting help us generalize better To answer this question we need to look at how confident we are in our predictions Margins 18 Margins Boosting Intuition larger margins lead to better generalization same as SVMs Theorem with high probability boosting increases the size of the margins Note boosting does NOT maximize the margin so it could still result in poor generalization performance 19 Boosting Performance 20 Boosting as Optimization AdaBoost can be interpreted as a coordinate descent method for a specific loss function Let be the set of all weak learners Exponential loss 1 1 Convex in exp AdaBoost minimizes this exponential loss 21 Coordinate Descent Minimize the loss with respect to a single component of let s pick 0 exp exp exp exp exp 22 Coordinate Descent Solving for 1 2 This is similar to the adaBoost update exp ln exp The only difference is that adaBoost tells us in which order we should update the variables 23 AdaBoost as Optimization Could derive an adaBoost algorithm for other types of loss functions Important to note Exponential loss is convex but not strict may have multiple global optima In practice adaBoost can perform quite differently than other methods for minimizing this loss e g gradient descent 25 Boosting in Practice Our description of the algorithm assumed that a set of possible hypotheses was given In practice the set of hypotheses can be built as the algorithm progress Example build new decision tree at each iteration for the data example has weight set in which the When computing information gain compute the empirical probabilities using the weights 26 Boosting vs Bagging Bagging doesn t work well with stable models Boosting might still help Boosting might hurt performance on noisy datasets Bagging doesn t have this problem On average boosting improves classification accuracy more than bagging but it is also more common for boosting to hurt performance Both ensemble methods have added overhead required to train Bagging is easier to parallelize multiple classifiers 27 Boosting Beyond Binary Classification Slightly more complicated Want to select weak learners that are better than random guessing but there are many different ways to do better than random A hypothesis space is boostable if there exists a baseline measure that is slightly better than random such that you can always find a hypothesis that outperforms the baseline Can be boostable with respect to some baselines but not others 28
View Full Document