11/9/20091Machine Learning - 10601BoostingGeoff Gordon, MiroslavDudík([[[partly based on slides of Rob Schapire and Carlos Guestrin]http://www.cs.cmu.edu/~ggordon/10601/November 9, 2009 Ensembles of treesBAGGING andRANDOM FORESTS• learn many big trees• each tree aims to fitthe same target concept– random training sets– randomized tree growth• voting ≈ averaging:DECREASE in VARIANCEBOOSTING• learn many small trees(weak classifiers)• each tree ‘specializes’ to a different part of target concept– reweight training examples– higher weights where still errors• voting increases expressivity:DECREASE in BIAS11/9/20092Boosting• boosting = general method of convertingrough rules of thumb (e.g., decision stumps)into highly accurate prediction ruleBoosting• boosting = general method of convertingrough rules of thumb (e.g., decision stumps)into highly accurate prediction rule• technically:– assume given “weak” learning algorithm that can consistently find classifiers (“rules of thumb”) at least slightly better than random, say, accuracy ≥ 55%(in two-class setting)– given sufficient data, a boosting algorithm can provably construct single classifier with very high accuracy, say, 99%11/9/20093AdaBoost[Freund-Schapire 1995]11/9/20094weak classifiers = decision stumps(vertical or horizontal half-planes)11/9/2009511/9/20096A typical run of AdaBoost• training error rapidly drops(combining weak learners increases expressivity)• test error does not increase with number of trees T(robustness to overfitting)11/9/2009711/9/20098Bounding true error[Freund-Schapire 1997]• T = number of rounds• d = VC dimension of weak learner• m = number of training examples11/9/20099Bounding true error (a first guess)A typical runcontradicts a naïve bound11/9/200910Finer analysis: margins[Schapire et al. 1998]Empirical evidence: margin distribution11/9/200911Theoretical evidence:large margins simple classifiersMore technically…Bound depends on:• d = VC dimension of weak learner• m = number of training examples• entire distribution of training marginsPreviously11/9/200912Practical advantages of AdaBoostApplication: detecting faces[Viola-Jones 2001]11/9/200913Caveats“Hard” predictionscan slow downlearning!11/9/200914Confidence-rated Predictions[Schapire-Singer 1999]Confidence-rated Predictions11/9/200915Confidence-ratedpredictionshelp a lot!Loss in logistic regression11/9/200916Loss in AdaBoostLogistic regression vs AdaBoost11/9/200917Benefits of model-fitting viewWhat you should know about boosting• weak classifiers strong classifiers– weak: slightly better than random on training data– strong: eventually zero error on training data• AdaBoost prevents overfitting by increasing margins• regimes when AdaBoost overfits– weak learner too strong: use small trees or stop early– data noisy: stop early• AdaBoost vs Logistic Regression– exponential loss vs log loss– single-coordinate updates vs full
View Full Document