BoostingAarti SinghMachine Learning 10-701/15-781Oct 11, 2010Slides Courtesy: Carlos Guestrin, Freund & Schapire1Can we make dumb learners smart?Project Proposal DueToday!2Why boost weak learners?Goal: Automatically categorize type of call requested (Collect, Calling card, Person-to-person, etc.)• Easy to find “rules of thumb” that are “often” correct.E.g. If ‘card’ occurs in utterance, then predict ‘calling card’• Hard to find single highly accurate prediction rule.3• Simple (a.k.a. weak) learners e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees)Are good - Low variance, don’t usually overfitAre bad - High bias, can’t solve hard learning problems• Can we make weak learners always good???– No!!! But often yes…Fighting the bias-variance tradeoff4Voting (Ensemble Methods)• Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space• Output class: (Weighted) vote of each classifier– Classifiers that are most “sure” will vote with more conviction– Classifiers will be most “sure” about a particular part of the space– On average, do better than single classifier!51 -1??? ?1-1H: X → Y (-1,1)h1(X) h2(X)H(X) = sign(∑αt ht(X))tweightsH(X) = h1(X)+h2(X)Voting (Ensemble Methods)• Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space• Output class: (Weighted) vote of each classifier– Classifiers that are most “sure” will vote with more conviction– Classifiers will be most “sure” about a particular part of the space– On average, do better than single classifier!• But how do you ??? – force classifiers htto learn about different parts of the input space?– weigh the votes of different classifiers? t6Boosting [Schapire’89]• Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote• On each iteration t: – weight each training example by how incorrectly it was classified – Learn a weak hypothesis – ht– A strength for this hypothesis – t• Final classifier:• Practically useful• Theoretically interesting7H(X) = sign(∑αt ht(X))Learning from weighted data• Consider a weighted dataset– D(i) – weight of i th training example (xi,yi)– Interpretations:• i th training example counts as D(i) examples• If I were to “resample” data, I would get more samples of “heavier” data points• Now, in all calculations, whenever used, i th training example counts as D(i) “examples”– e.g., in MLE redefine Count(Y=y) to be weighted countUnweighted data Weights D(i)Count(Y=y) = ∑ 1(Y i=y) Count(Y=y) = ∑ D(i)1(Y i=y)8i =1mi =1m9weakweakInitially equal weightsNaïve bayes, decision stumpMagic (+ve)Increase weight if wrong on pt iyi ht(xi) = -1 < 0AdaBoost [Freund & Schapire’95]10weakweakInitially equal weightsNaïve bayes, decision stumpMagic (+ve)Increase weight if wrong on pt iyi ht(xi) = -1 < 0AdaBoost [Freund & Schapire’95]Weights for all pts must sum to 1∑ Dt+1(i) = 1t11weakweakInitially equal weightsNaïve bayes, decision stumpMagic (+ve)Increase weight if wrong on pt iyi ht(xi) = -1 < 0AdaBoost [Freund & Schapire’95]12εt= 0 if htperfectly classifies all weighted data pts t= ∞εt= 1 if htperfectly wrong => -htperfectly right t = -∞εt = 0.5 t= 0 Does ht get ithpoint wrongWeighted training errorWhat tto choose for hypothesis ht?Weight Update Rule:[Freund & Schapire’95]Boosting Example (Decision Stumps)1314Boosting Example (Decision Stumps)Analysis reveals:• What tto choose for hypothesis ht?εt- weighted training error• If each weak learner htis slightly better than random guessing (εt < 0.5), then training error of AdaBoost decays exponentially fast in number of rounds T.15Analyzing training errorTraining ErrorTraining error of final classifier is bounded by:Where 16Analyzing training errorConvex upper boundIf boosting can makeupper bound → 0, thentraining error → 0100/1 lossexp lossTraining error of final classifier is bounded by:WhereProof:17Analyzing training error…Wts of all pts add to 1Using Weight Update RuleTraining error of final classifier is bounded by:Where18Analyzing training errorIf Zt < 1, training error decreases exponentially (even though weak learners maynot be good εt ~0.5)Training errortUpper boundTraining error of final classifier is bounded by:Where If we minimize tZt, we minimize our training errorWe can tighten this bound greedily, by choosing tand hton each iteration to minimize Zt.19What tto choose for hypothesis ht?We can minimize this bound by choosing ton each iteration to minimize Zt.For boolean target function, this is accomplished by [Freund & Schapire ’97]: Proof:20What tto choose for hypothesis ht?We can minimize this bound by choosing ton each iteration to minimize Zt.For boolean target function, this is accomplished by [Freund & Schapire ’97]: Proof:21What tto choose for hypothesis ht?Training error of final classifier is bounded by:22Dumb classifiers made SmartIf each classifier is (at least slightly) better than random t< 0.5AdaBoost will achieve zero training error exponentially fast (innumber of rounds T) !!grows as t movesaway from 1/2What about test error?Boosting results – Digit recognition• Boosting often, – Robust to overfitting– Test set error decreases even after training error is zero23[Schapire, 1989]but not alwaysTest ErrorTraining Error• T – number of boosting rounds• d – VC dimension of weak learner, measures complexity of classifier • m – number of training examples24Generalization Error BoundsT smalllargesmallT largesmalllargetradeoffbiasvariance[Freund & Schapire’95]25Generalization Error BoundsBoosting can overfit if T is largeBoosting often, Contradicts experimental results– Robust to overfitting– Test set error decreases even after training error is zeroNeed better analysis tools – margin based bounds[Freund & Schapire’95]With highprobability26Margin Based BoundsBoosting increases the margin very aggressively since it concentrates on the hardest examples.If margin is large, more weak learners agree and hence more rounds does not necessarily imply that final classifier is getting more complex.Bound is independent of number of rounds T!Boosting can still overfit if margin is
View Full Document