Machine Learning 1010 701 15701 15 781 Spring 2008 Boosting from Weak Learners Eric Xing Lecture 10 February 18 2008 Reading Chap 14 3 C B book Rationale Combination of methods z There is no algorithm that is always the most accurate z We can select simple weak classification or regression methods and combine them into a single strong method z Different learners use different z z Algorithms z Hyperparameters z Representations Modalities z Training sets z Subproblems The problem how to combine them 1 Some early algorithms z z z Boosting by filtering Schapire 1990 z Run weak learner on differently filtered example sets z Combine weak hypotheses z Requires knowledge on the performance of weak learner Boosting by majority Freund 1995 z Run weak learner on weighted example set z Combine weak hypotheses linearly z Requires knowledge on the performance of weak learner Bagging Breiman 1996 z Run weak learner on bootstrap replicates of the training set z Average weak hypotheses z Reduces variance Combination of classifiers z Suppose we have a family of component classifiers generating 1 labels such as decision stumps h x sign wxk b where k w b z Each decision stump pays attention to only a single component of the input vector 2 Combination of classifiers con d z We d like to combine the simple classifiers additively so that the final classifier is the sign of h x 1 h x 1 K m h x m where the votes i emphasize component classifiers that make more reliable predictions than others z Important issues z what is the criterion that we are optimizing measure of loss z we would like to estimate each new component classifier in the same manner modularity Measurement of error z Loss function e g I y h x y h x z Generalization error L h E y h x z Objective find h with minimum generalization error z Main boosting idea minimize the empirical error 1 L h N N y h x i 1 i i 3 Exponential Loss z Empirical loss 1 L h N z N y h x i i 1 i Another possible measure of empirical loss is n L h exp yi h m x i i 1 Exponential Loss z One possible measure of empirical loss is n Recall that L h exp yi h m x i i 1 n h m x 1 h x 1 K m h x m exp yi h m 1 x i yi am h x i m i 1 n exp yi h m 1 x i exp yi am h x i m i 1 n Wi m 1 exp yi am h x i m Wi m 1 exp yi h m 1 x i i 1 z The combined classifier based on m 1 iterations defines a weighted loss criterion for the next simple classifier to add z each training sample is weighted by its classifiability or difficulty seen by the classifier we have built so far 4 Linearization of loss function z We can simplify a bit the estimation criterion for the new component classifiers assuming is small exp yi am h x i m 1 yi am h x i m z Now our empirical loss criterion reduces to exp y h n i m i 1 x i n Wi m 1 1 yi am h x i m i 1 n Wi m 1 i 1 z n am Wi i 1 m 1 Wi m 1 exp yi h m 1 x i yi h x i m We could choose a new component classifier to optimize this weighted agreement A possible algorithm z At stage m we find that maximize or at least give a sufficiently high weighted agreement n W i 1 z i m 1 yi h x i m z each sample is weighted by its difficulty under the previously combined m 1 classifiers z more difficult samples received heavier attention as they dominates the total loss Then we go back and find the votes m associated with the new classifier by minimizing the original weighted exponential loss n L h Wi m 1 exp yi am h x i m i 1 n n i 1 i 1 Wi m 1 am Wi m 1 yi h x i m 5 Boosting z We have basically derived a Boosting algorithm that sequentially adds new component classifiers each trained on reweighted training examples z z each component classifier is presented with a slightly different problem AdaBoost preliminaries z we work with normalized weights Wi on the training examples initially uniform Wi 1 n z the weight reflect the degree of difficulty of each datum on the latest classifier The AdaBoost algorithm z Wi m 1 exp yi h m 1 x i At the kth iteration we find any classifier h x k for which the weighted classification error 1 n 2 i 1 k 0 5 Wi k 1 yi h x i k is better than change z z Determine how many votes to assign to the new component classifier k 0 5 log 1 k k z z This is meant to be easy weak classifier stronger classifier gets more votes Update the weights on the training examples Wi k Wi k 1 exp yi ak h x i k 6 The AdaBoost algorithm cont d z The final classifier after m boosting iterations is given by the sign of h x 1 K m h x m h x 1 1 K m z the votes here are normalized for convenience AdaBoost summary z z z Input z N examples SN x1 y1 xN yN z a weak base learner h h x Initialize equal example weights wi 1 N for all i 1 N Iterate for t 1 T 1 z train base learner according to weighted example set wt x and obtain hypothesis ht h x t 2 compute hypothesis error t 3 compute hypothesis weight t 4 update example weights for next iteration wt 1 Output final hypothesis as a linear combination of ht 7 AdaBoost dataflow diagram w1 A w S w2 A w S wT A w S Boosting examples 8 Boosting example cont d Boosting example cont d 9 Base Learners z z Weak learners used in practice z Decision stumps axis parallel splits z Decision trees e g C4 5 by Quinlan 1996 z Multi layer neural networks z Radial basis function networks Can base learners operate on weighted examples z In many cases they can be modified to accept weights along with the examples z In general we can sample the examples with replacement according to the distribution defined by the weights Boosting performance z The error rate of component classifier the decision stumps does not improve much if at all over time z But both training and testing error improve over time z Even after the training error of the combined classifier goes to zero boosting iterations can still improve the generalization error 10 Why it is working z You will need some learning theory …
View Full Document