1Machine LearningMachine Learning1010--701/15701/15--781, Spring 2008781, Spring 2008Boosting from Weak LearnersBoosting from Weak LearnersEric XingEric XingLecture 10, February 18, 2008Reading: Chap. 14.3 C.B bookRationale: Combination of methodsz There is no algorithm that is always the most accuratez We can select simple “weak” classification or regression methods and combine them into a single “strong” methodz Different learners use differentz Algorithmsz Hyperparametersz Representations (Modalities)z Training setsz Subproblemsz The problem: how to combine them2Some early algorithmsz Boosting by filtering (Schapire 1990)z Run weak learner on differently filtered example setsz Combine weak hypothesesz Requires knowledge on the performance of weak learnerz Boosting by majority (Freund 1995)z Run weak learner on weighted example setz Combine weak hypotheses linearlyz Requires knowledge on the performance of weak learnerz Bagging (Breiman 1996)z Run weak learner on bootstrap replicates of the training setz Average weak hypothesesz Reduces varianceCombination of classifiersz Suppose we have a family of component classifiers (generating ±1 labels) such as decision stumps:where θ = {k,w,b}z Each decision stump pays attention to only a single component of the input vector()bwxxhk+=sign);(θ3Combination of classifiers con’dz We’d like to combine the simple classifiers additively so that the final classifier is the sign ofwhere the “votes” {αi} emphasize component classifiers that make more reliable predictions than othersz Important issues:z what is the criterion that we are optimizing? (measure of loss)z we would like to estimate each new component classifier in the same manner (modularity));();()(ˆmmhhhθαθαxxx ++= K11Measurement of errorz Loss function:z Generalization error:z Objective: find h with minimum generalization errorz Main boosting idea: minimize the empirical error:() ))(( e.g. ))(,( xx hyIhy≠λ[]))(,()( xhyEhLλ=∑==NiiihyNhL11))(,()(ˆxλ4Exponential Lossz Empirical loss:z Another possible measure of empirical loss is{}∑=−=niimihyhL1)(ˆexp)(ˆx∑==NiiihyNhL11))(,()(ˆxλExponential Lossz One possible measure of empirical loss isz The combined classifier based on m − 1 iterations defines a weighted loss criterion for the next simple classifier to addz each training sample is weighted by its "classifiability" (or difficulty) seen by the classifier we have built so far {}{}{}{}{});(exp);(exp)(ˆexp);()(ˆexp)(ˆexp)(ˆmiminimimiminiiminimimiiminiimihayWhayhyhayhyhyhLθθθxxxxxx−=−−=−−=−=∑∑∑∑=−=−=−=1111111Recall that:);();()(ˆmmmhhhθαθαxxx ++= K11{})(ˆexpimimihyW x11−−−=5Linearization of loss functionz We can simplify a bit the estimation criterion for the new component classifiers (assuming αis small)z Now our empirical loss criterion reduces toz We could choose a new component classifier to optimize this weighted agreement{});();(expmimimimihayhayθθxx−≈− 1{}∑∑∑∑=−=−=−=−=−≈−nimiimimnimimiminiminiimihyWaWhayWhy11111111);());(()(ˆexpθθxxx{})(ˆexpimimihyW x11−−−=A possible algorithmz At stage m we find θ* that maximize (or at least give a sufficiently high) weighted agreement:z each sample is weighted by its "difficulty" under the previously combined m − 1 classifiers,z more "difficult" samples received heavier attention as they dominates the total lossz Then we go back and find the “votes”αm* associated with the new classifier by minimizing the original weighted (exponential) loss∑=−nimiimihyW11);(*θx{}∑∑∑=−=−=−−=−=nimiimimnimimiminimihyWaWhayWhL111111);();(exp)(ˆθθxx6Boostingz We have basically derived a Boosting algorithm that sequentially adds new component classifiers, each trained on reweighted training examplesz each component classifier is presented with a slightly different problemz AdaBoost preliminaries:z we work with normalized weights Wion the training examples, initially uniform ( Wi= 1/n)z the weight reflect the "degree of difficulty" of each datum on the latest classifier The AdaBoost algorithmz At the kth iteration we find (any) classifier h(x; θk*) for which the weighted classification error:is better than change.z This is meant to be "easy" --- weak classifierz Determine how many “votes” to assign to the new component classifier:z stronger classifier gets more votesz Update the weights on the training examples:⎟⎠⎞⎜⎝⎛−=∑=−nikiikikhyW112150 );(.*θεx()kkkεε /)(log.−=150α{});(expkikikikihayWWθx−=−1{})(ˆexpimimihyW x11−−−=7The AdaBoost algorithm cont’dz The final classifier after m boosting iterations is given by thesign ofz the votes here are normalized for conveniencemmmhhhααθαθα++++=KK111);();()(ˆxxxAdaBoost: summaryz Input:z N examples SN= {(x1,y1),…, (xN,yN)}z a weak base learner h = h(x,θ)z Initialize: equal example weights wi= 1/N for all i = 1..Nz Iterate for t = 1…T:1. train base learner according to weighted example set (wt,x) and obtain hypothesis ht= h(x,θt)2. compute hypothesis error εt3.compute hypothesis weight αt4.update example weights for next iteration wt+1z Output: final hypothesis as a linear combination of ht8AdaBoost: dataflow diagramw1w2wTA(w,S)A(w,S) A(w,S)Boosting: examples9Boosting: example cont’dBoosting: example cont’d10Base Learnersz Weak learners used in practice:z Decision stumps (axis parallel splits)z Decision trees (e.g. C4.5 by Quinlan 1996)z Multi-layer neural networksz Radial basis function networksz Can base learners operate on weighted examples?z In many cases they can be modified to accept weights along with the examplesz In general, we can sample the examples (with replacement) according to the distribution defined by the weightsBoosting performancez The error rate of component classifier (the decision stumps) does not improve much (if at all) over timez But both training and testing error improve over time!z Even after the training error of the combined classifier goes to zero, boosting iterations can still improve the generalization error!!11Why it is working?z You will need some learning theory (to be covered in the next two lectures) to understand this fully, but for now let's just go over some high level ideasz Generalization Error:With high probability, Generalization error is less than:As T goes up, our bound becomes worse, Boosting should overfit!TrainingerrorTesterrorThe Boosting Approach to Machine
View Full Document