DOC PREVIEW
CMU CS 10701 - Boosting

This preview shows page 1-2-16-17-18-34-35 out of 35 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

BoostingAarti SinghMachine Learning 10-701/15-781Oct 11, 2010Slides Courtesy: Carlos Guestrin, Freund & Schapire1Can we make dumb learners smart?Project Proposal DueToday!2Why boost weak learners?Goal: Automatically categorize type of call requested (Collect, Calling card, Person-to-person, etc.)• Easy to find “rules of thumb” that are “often” correct.E.g. If ‘card’ occurs in utterance, then predict ‘calling card’• Hard to find single highly accurate prediction rule.3• Simple (a.k.a. weak) learners e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees)Are good  - Low variance, don’t usually overfitAre bad  - High bias, can’t solve hard learning problems• Can we make weak learners always good???– No!!! But often yes…Fighting the bias-variance tradeoff4Voting (Ensemble Methods)• Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space• Output class: (Weighted) vote of each classifier– Classifiers that are most “sure” will vote with more conviction– Classifiers will be most “sure” about a particular part of the space– On average, do better than single classifier!51 -1??? ?1-1H: X → Y (-1,1)h1(X) h2(X)H(X) = sign(∑αt ht(X))tweightsH(X) = h1(X)+h2(X)Voting (Ensemble Methods)• Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space• Output class: (Weighted) vote of each classifier– Classifiers that are most “sure” will vote with more conviction– Classifiers will be most “sure” about a particular part of the space– On average, do better than single classifier!• But how do you ??? – force classifiers htto learn about different parts of the input space?– weigh the votes of different classifiers? t6Boosting [Schapire’89]• Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote• On each iteration t: – weight each training example by how incorrectly it was classified – Learn a weak hypothesis – ht– A strength for this hypothesis – t• Final classifier:• Practically useful• Theoretically interesting7H(X) = sign(∑αt ht(X))Learning from weighted data• Consider a weighted dataset– D(i) – weight of i th training example (xi,yi)– Interpretations:• i th training example counts as D(i) examples• If I were to “resample” data, I would get more samples of “heavier” data points• Now, in all calculations, whenever used, i th training example counts as D(i) “examples”– e.g., in MLE redefine Count(Y=y) to be weighted countUnweighted data Weights D(i)Count(Y=y) = ∑ 1(Y i=y) Count(Y=y) = ∑ D(i)1(Y i=y)8i =1mi =1m9weakweakInitially equal weightsNaïve bayes, decision stumpMagic (+ve)Increase weight if wrong on pt iyi ht(xi) = -1 < 0AdaBoost [Freund & Schapire’95]10weakweakInitially equal weightsNaïve bayes, decision stumpMagic (+ve)Increase weight if wrong on pt iyi ht(xi) = -1 < 0AdaBoost [Freund & Schapire’95]Weights for all pts must sum to 1∑ Dt+1(i) = 1t11weakweakInitially equal weightsNaïve bayes, decision stumpMagic (+ve)Increase weight if wrong on pt iyi ht(xi) = -1 < 0AdaBoost [Freund & Schapire’95]12εt= 0 if htperfectly classifies all weighted data pts t= ∞εt= 1 if htperfectly wrong => -htperfectly right t = -∞εt = 0.5 t= 0 Does ht get ithpoint wrongWeighted training errorWhat tto choose for hypothesis ht?Weight Update Rule:[Freund & Schapire’95]Boosting Example (Decision Stumps)1314Boosting Example (Decision Stumps)Analysis reveals:• What tto choose for hypothesis ht?εt- weighted training error• If each weak learner htis slightly better than random guessing (εt < 0.5), then training error of AdaBoost decays exponentially fast in number of rounds T.15Analyzing training errorTraining ErrorTraining error of final classifier is bounded by:Where 16Analyzing training errorConvex upper boundIf boosting can makeupper bound → 0, thentraining error → 0100/1 lossexp lossTraining error of final classifier is bounded by:WhereProof:17Analyzing training error…Wts of all pts add to 1Using Weight Update RuleTraining error of final classifier is bounded by:Where18Analyzing training errorIf Zt < 1, training error decreases exponentially (even though weak learners maynot be good εt ~0.5)Training errortUpper boundTraining error of final classifier is bounded by:Where If we minimize tZt, we minimize our training errorWe can tighten this bound greedily, by choosing tand hton each iteration to minimize Zt.19What tto choose for hypothesis ht?We can minimize this bound by choosing ton each iteration to minimize Zt.For boolean target function, this is accomplished by [Freund & Schapire ’97]: Proof:20What tto choose for hypothesis ht?We can minimize this bound by choosing ton each iteration to minimize Zt.For boolean target function, this is accomplished by [Freund & Schapire ’97]: Proof:21What tto choose for hypothesis ht?Training error of final classifier is bounded by:22Dumb classifiers made SmartIf each classifier is (at least slightly) better than random t< 0.5AdaBoost will achieve zero training error exponentially fast (innumber of rounds T) !!grows as t movesaway from 1/2What about test error?Boosting results – Digit recognition• Boosting often, – Robust to overfitting– Test set error decreases even after training error is zero23[Schapire, 1989]but not alwaysTest ErrorTraining Error• T – number of boosting rounds• d – VC dimension of weak learner, measures complexity of classifier • m – number of training examples24Generalization Error BoundsT smalllargesmallT largesmalllargetradeoffbiasvariance[Freund & Schapire’95]25Generalization Error BoundsBoosting can overfit if T is largeBoosting often, Contradicts experimental results– Robust to overfitting– Test set error decreases even after training error is zeroNeed better analysis tools – margin based bounds[Freund & Schapire’95]With highprobability26Margin Based BoundsBoosting increases the margin very aggressively since it concentrates on the hardest examples.If margin is large, more weak learners agree and hence more rounds does not necessarily imply that final classifier is getting more complex.Bound is independent of number of rounds T!Boosting can still overfit if margin is


View Full Document

CMU CS 10701 - Boosting

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Boosting
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Boosting and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Boosting 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?