Boosting Can we make dumb learners smart Aarti Singh Machine Learning 10 701 15 781 Oct 11 2010 Slides Courtesy Carlos Guestrin Freund Schapire 1 Project Proposal Due Today 2 Why boost weak learners Goal Automatically categorize type of call requested Collect Calling card Person to person etc Easy to find rules of thumb that are often correct E g If card occurs in utterance then predict calling card Hard to find single highly accurate prediction rule 3 Fighting the bias variance tradeoff Simple a k a weak learners e g na ve Bayes logistic regression decision stumps or shallow decision trees Are good Low variance don t usually overfit Are bad High bias can t solve hard learning problems Can we make weak learners always good No But often yes 4 Voting Ensemble Methods Instead of learning a single weak classifier learn many weak classifiers that are good at different parts of the input space Output class Weighted vote of each classifier Classifiers that are most sure will vote with more conviction Classifiers will be most sure about a particular part of the space On average do better than single classifier H X Y 1 1 h1 X h2 X H X h1 X h2 X H X sign t ht X t 1 1 1 1 weights 5 Voting Ensemble Methods Instead of learning a single weak classifier learn many weak classifiers that are good at different parts of the input space Output class Weighted vote of each classifier Classifiers that are most sure will vote with more conviction Classifiers will be most sure about a particular part of the space On average do better than single classifier But how do you force classifiers ht to learn about different parts of the input space weigh the votes of different classifiers t 6 Boosting Schapire 89 Idea given a weak learner run it multiple times on reweighted training data then let learned classifiers vote On each iteration t weight each training example by how incorrectly it was classified Learn a weak hypothesis ht A strength for this hypothesis t Final classifier H X sign t ht X Practically useful Theoretically interesting 7 Learning from weighted data Consider a weighted dataset D i weight of i th training example xi yi Interpretations i th training example counts as D i examples If I were to resample data I would get more samples of heavier data points Now in all calculations whenever used i th training example counts as D i examples e g in MLE redefine Count Y y to be weighted count Unweighted data m Count Y y 1 Y i y i 1 Weights D i m Count Y y D i 1 Y i y i 1 8 AdaBoost Freund Schapire 95 Initially equal weights Na ve bayes decision stump weak weak Magic ve Increase weight if wrong on pt i yi ht xi 1 0 9 AdaBoost Freund Schapire 95 Initially equal weights Na ve bayes decision stump weak weak Magic ve Increase weight if wrong on pt i yi ht xi 1 0 Weights for all pts must sum to 1 Dt 1 i 1 t 10 AdaBoost Freund Schapire 95 Initially equal weights Na ve bayes decision stump weak weak Magic ve Increase weight if wrong on pt i yi ht xi 1 0 11 What t to choose for hypothesis ht Weight Update Rule Freund Schapire 95 Weighted training error Does ht get ith point wrong t 0 if ht perfectly classifies all weighted data pts t 1 if ht perfectly wrong ht perfectly right t 0 5 t t t 0 12 Boosting Example Decision Stumps 13 Boosting Example Decision Stumps 14 Analyzing training error Analysis reveals What t to choose for hypothesis ht t weighted training error If each weak learner ht is slightly better than random guessing t 0 5 then training error of AdaBoost decays exponentially fast in number of rounds T Training Error 15 Analyzing training error Training error of final classifier is bounded by Convex upper bound Where exp loss If boosting can make upper bound 0 then training error 0 0 1 loss 1 0 16 Analyzing training error Training error of final classifier is bounded by Where Proof Using Weight Update Rule Wts of all pts add to 1 17 Analyzing training error Training error of final classifier is bounded by Where If Zt 1 training error decreases exponentially even though weak learners may not be good t 0 5 Training error Upper bound t 18 What t to choose for hypothesis ht Training error of final classifier is bounded by Where If we minimize t Zt we minimize our training error We can tighten this bound greedily by choosing t and ht on each iteration to minimize Zt 19 What t to choose for hypothesis ht We can minimize this bound by choosing t on each iteration to minimize Zt For boolean target function this is accomplished by Freund Schapire 97 Proof 20 What t to choose for hypothesis ht We can minimize this bound by choosing t on each iteration to minimize Zt For boolean target function this is accomplished by Freund Schapire 97 Proof 21 Dumb classifiers made Smart Training error of final classifier is bounded by grows as t moves away from 1 2 If each classifier is at least slightly better than random t 0 5 AdaBoost will achieve zero training error exponentially fast in number of rounds T What about test error 22 Boosting results Digit recognition Schapire 1989 Test Error Training Error Boosting often but not always Robust to overfitting Test set error decreases even after training error is zero 23 Generalization Error Bounds Freund Schapire 95 bias variance large small T small small large T large tradeoff T number of boosting rounds d VC dimension of weak learner measures complexity of classifier m number of training examples 24 Generalization Error Bounds Freund Schapire 95 With high probability Boosting can overfit if T is large Boosting often Contradicts experimental results Robust to overfitting Test set error decreases even after training error is zero Need better analysis tools margin based bounds 25 Margin Based Bounds Schapire Freund Bartlett Lee 98 With high probability Boosting increases the margin very aggressively since it concentrates on the hardest examples If margin is large more weak learners agree and hence more rounds does not necessarily imply that final classifier is getting more complex Bound is independent of number of rounds T Boosting can still overfit if margin is too small weak learners are too complex or perform arbitrarily close to random guessing 26 Boosting Experimental Results Freund Schapire 1996 error Comparison of C4 5 decision trees vs Boosting decision stumps depth 1 trees C4 5 vs Boosting C4 5 27 benchmark datasets error error 27 Train Test Train Test Overfits Overfits Overfits Overfits 28 Boosting and Logistic Regression Logistic regression assumes And tries to
View Full Document