DOC PREVIEW
CMU CS 10701 - lecture

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Machine LearningMachine Learning1010--701/15701/15--781, Spring 2008781, Spring 2008Boosting from Weak LearnersBoosting from Weak LearnersEric XingEric XingLecture 10, February 18, 2008Reading: Chap. 14.3 C.B bookRationale: Combination of methodsz There is no algorithm that is always the most accuratez We can select simple “weak” classification or regression methods and combine them into a single “strong” methodz Different learners use differentz Algorithmsz Hyperparametersz Representations (Modalities)z Training setsz Subproblemsz The problem: how to combine them2Some early algorithmsz Boosting by filtering (Schapire 1990)z Run weak learner on differently filtered example setsz Combine weak hypothesesz Requires knowledge on the performance of weak learnerz Boosting by majority (Freund 1995)z Run weak learner on weighted example setz Combine weak hypotheses linearlyz Requires knowledge on the performance of weak learnerz Bagging (Breiman 1996)z Run weak learner on bootstrap replicates of the training setz Average weak hypothesesz Reduces varianceCombination of classifiersz Suppose we have a family of component classifiers (generating ±1 labels) such as decision stumps:where θ = {k,w,b}z Each decision stump pays attention to only a single component of the input vector()bwxxhk+=sign);(θ3Combination of classifiers con’dz We’d like to combine the simple classifiers additively so that the final classifier is the sign ofwhere the “votes” {αi} emphasize component classifiers that make more reliable predictions than othersz Important issues:z what is the criterion that we are optimizing? (measure of loss)z we would like to estimate each new component classifier in the same manner (modularity));();()(ˆmmhhhθαθαxxx ++= K11Measurement of errorz Loss function:z Generalization error:z Objective: find h with minimum generalization errorz Main boosting idea: minimize the empirical error:() ))(( e.g. ))(,( xx hyIhy≠λ[]))(,()( xhyEhLλ=∑==NiiihyNhL11))(,()(ˆxλ4Exponential Lossz Empirical loss:z Another possible measure of empirical loss is{}∑=−=niimihyhL1)(ˆexp)(ˆx∑==NiiihyNhL11))(,()(ˆxλExponential Lossz One possible measure of empirical loss isz The combined classifier based on m − 1 iterations defines a weighted loss criterion for the next simple classifier to addz each training sample is weighted by its "classifiability" (or difficulty) seen by the classifier we have built so far {}{}{}{}{});(exp);(exp)(ˆexp);()(ˆexp)(ˆexp)(ˆmiminimimiminiiminimimiiminiimihayWhayhyhayhyhyhLθθθxxxxxx−=−−=−−=−=∑∑∑∑=−=−=−=1111111Recall that:);();()(ˆmmmhhhθαθαxxx ++= K11{})(ˆexpimimihyW x11−−−=5Linearization of loss functionz We can simplify a bit the estimation criterion for the new component classifiers (assuming αis small)z Now our empirical loss criterion reduces toz We could choose a new component classifier to optimize this weighted agreement{});();(expmimimimihayhayθθxx−≈− 1{}∑∑∑∑=−=−=−=−=−≈−nimiimimnimimiminiminiimihyWaWhayWhy11111111);());(()(ˆexpθθxxx{})(ˆexpimimihyW x11−−−=A possible algorithmz At stage m we find θ* that maximize (or at least give a sufficiently high) weighted agreement:z each sample is weighted by its "difficulty" under the previously combined m − 1 classifiers,z more "difficult" samples received heavier attention as they dominates the total lossz Then we go back and find the “votes”αm* associated with the new classifier by minimizing the original weighted (exponential) loss∑=−nimiimihyW11);(*θx{}∑∑∑=−=−=−−=−=nimiimimnimimiminimihyWaWhayWhL111111);();(exp)(ˆθθxx6Boostingz We have basically derived a Boosting algorithm that sequentially adds new component classifiers, each trained on reweighted training examplesz each component classifier is presented with a slightly different problemz AdaBoost preliminaries:z we work with normalized weights Wion the training examples, initially uniform ( Wi= 1/n)z the weight reflect the "degree of difficulty" of each datum on the latest classifier The AdaBoost algorithmz At the kth iteration we find (any) classifier h(x; θk*) for which the weighted classification error:is better than change.z This is meant to be "easy" --- weak classifierz Determine how many “votes” to assign to the new component classifier:z stronger classifier gets more votesz Update the weights on the training examples:⎟⎠⎞⎜⎝⎛−=∑=−nikiikikhyW112150 );(.*θεx()kkkεε /)(log.−=150α{});(expkikikikihayWWθx−=−1{})(ˆexpimimihyW x11−−−=7The AdaBoost algorithm cont’dz The final classifier after m boosting iterations is given by thesign ofz the votes here are normalized for conveniencemmmhhhααθαθα++++=KK111);();()(ˆxxxAdaBoost: summaryz Input:z N examples SN= {(x1,y1),…, (xN,yN)}z a weak base learner h = h(x,θ)z Initialize: equal example weights wi= 1/N for all i = 1..Nz Iterate for t = 1…T:1. train base learner according to weighted example set (wt,x) and obtain hypothesis ht= h(x,θt)2. compute hypothesis error εt3.compute hypothesis weight αt4.update example weights for next iteration wt+1z Output: final hypothesis as a linear combination of ht8AdaBoost: dataflow diagramw1w2wTA(w,S)A(w,S) A(w,S)Boosting: examples9Boosting: example cont’dBoosting: example cont’d10Base Learnersz Weak learners used in practice:z Decision stumps (axis parallel splits)z Decision trees (e.g. C4.5 by Quinlan 1996)z Multi-layer neural networksz Radial basis function networksz Can base learners operate on weighted examples?z In many cases they can be modified to accept weights along with the examplesz In general, we can sample the examples (with replacement) according to the distribution defined by the weightsBoosting performancez The error rate of component classifier (the decision stumps) does not improve much (if at all) over timez But both training and testing error improve over time!z Even after the training error of the combined classifier goes to zero, boosting iterations can still improve the generalization error!!11Why it is working?z You will need some learning theory (to be covered in the next two lectures) to understand this fully, but for now let's just go over some high level ideasz Generalization Error:With high probability, Generalization error is less than:As T goes up, our bound becomes worse, Boosting should overfit!TrainingerrorTesterrorThe Boosting Approach to Machine


View Full Document

CMU CS 10701 - lecture

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?