11/9/20091Machine Learning - 10601BoostingGeoff Gordon, MiroslavDudík([[[partly based on slides of Rob Schapire and Carlos Guestrin]http://www.cs.cmu.edu/~ggordon/10601/November 9, 2009 Ensembles of treesBAGGING andRANDOM FORESTS• learn many big trees• each tree aims to fitthe same target concept– random training sets– randomized tree growth• voting ≈ averaging:DECREASE in VARIANCE11/9/20092Ensembles of treesBAGGING andRANDOM FORESTS• learn many big trees• each tree aims to fitthe same target concept– random training sets– randomized tree growth• voting ≈ averaging:DECREASE in VARIANCEBOOSTING• learn many small trees(weak classifiers)• each tree ‘specializes’ to a different part of target concept– reweight training examples– higher weights where still errors• voting increases expressivity:DECREASE in BIASBoosting• boosting = general method of convertingrough rules of thumb (e.g., decision stumps)into highly accurate prediction rule11/9/20093Boosting• boosting = general method of convertingrough rules of thumb (e.g., decision stumps)into highly accurate prediction rule• technically:– assume given “weak” learning algorithm that can consistently find classifiers (“rules of thumb”) at least slightly better than random, say, accuracy ≥ 55%(in two-class setting)– given sufficient data, a boosting algorithm can provably construct single classifier with very high accuracy, say, 99%11/9/2009411/9/20095AdaBoost[Freund-Schapire 1995]11/9/20096AdaBoost[Freund-Schapire 1995]AdaBoost[Freund-Schapire 1995]11/9/20097AdaBoost[Freund-Schapire 1995]AdaBoost[Freund-Schapire 1995]11/9/20098weak classifiers = decision stumps(vertical or horizontal half-planes)11/9/2009911/9/200910A typical run of AdaBoost• training error rapidly drops(combining weak learners increases expressivity)• test error does not increase with number of trees T(robustness to overfitting)11/9/20091111/9/20091211/9/20091311/9/20091411/9/20091511/9/200916Bounding true error[Freund-Schapire 1997]• T = number of rounds• d = VC dimension of weak learner• m = number of training examples11/9/200917Bounding true error (a first guess)A typical run“contradicts” a naïve bound11/9/200918Finer analysis: margins[Schapire et al. 1998]Finer analysis: margins[Schapire et al. 1998]11/9/200919Finer analysis: margins[Schapire et al. 1998]Finer analysis: margins[Schapire et al. 1998]11/9/200920Empirical evidence: margin distributionEmpirical evidence: margin distribution11/9/200921Theoretical evidence:large margins simple classifiersTheoretical evidence:large margins simple classifiers11/9/200922Theoretical evidence:large margins simple classifiersTheoretical evidence:large margins simple classifiers11/9/200923More technically…New generalization bound:Bound depends on:• d = VC dim of weak learner• m = #training examples• entire distribution• of training marginsMore technically…PreviouslyNew generalization bound:11/9/200924Practical advantages of AdaBoostApplication: detecting faces[Viola-Jones 2001]11/9/200925Caveats11/9/20092611/9/200927“Hard” predictionscan slow downlearning!Confidence-rated Predictions[Schapire-Singer 1999]11/9/200928Confidence-rated Predictions[Schapire-Singer 1999]Confidence-rated Predictions[Schapire-Singer 1999]11/9/200929Confidence-rated PredictionsConfidence-rated Predictions11/9/200930Confidence-rated PredictionsConfidence-ratedpredictionshelp a lot!11/9/200931Loss in logistic regressionLoss in logistic regression11/9/200932Loss in logistic regressionLoss in AdaBoost11/9/200933Loss in AdaBoostLogistic regression vs AdaBoost11/9/200934Logistic regression vs AdaBoostLogistic regression vs AdaBoost11/9/200935Benefits of model-fitting viewBenefits of model-fitting view11/9/200936Benefits of model-fitting viewWhat you should know about boosting• weak classifiers strong classifiers– weak: slightly better than random on training data– strong: eventually zero error on training data• AdaBoost prevents overfitting by increasing margins• regimes when AdaBoost overfits– weak learner too strong: use smaller trees or stop early– data noisy: stop early or regularize αt• AdaBoost vs Logistic Regression– exponential loss vs log loss– single-coordinate updates vs full
View Full Document