Machine Learning 1010 701 15701 15 781 Spring 2008 Boosting from Weak Learners Eric Xing Lecture 9 February 13 2008 Reading Chap 14 3 C B book Rationale Combination of methods z There is no algorithm that is always the most accurate z We can select simple weak classification or regression methods and combine them into a single strong method z Different learners use different z z Algorithms z Hyperparameters z Representations Modalities z Training sets z Subproblems The problem how to combine them 1 Some early algorithms z z z Boosting by filtering Schapire 1990 z Run weak learner on differently filtered example sets z Combine weak hypotheses z Requires knowledge on the performance of weak learner Boosting by majority Freund 1995 z Run weak learner on weighted example set z Combine weak hypotheses linearly z Requires knowledge on the performance of weak learner Bagging Breiman 1996 z Run weak learner on bootstrap replicates of the training set z Average weak hypotheses z Reduces variance Combination of classifiers z Suppose we have a family of component classifiers generating 1 labels such as decision stumps h x sign wxk b where k w b z Each decision stump pays attention to only a single component of the input vector 2 Combination of classifiers con d z We d like to combine the simple classifiers additively so that the final classifier is the sign of h x 1 h x 1 K m h x m where the votes i emphasize component classifiers that make more reliable predictions than others z Important issues z what is the criterion that we are optimizing measure of loss z we would like to estimate each new component classifier in the same manner modularity Measurement of error z Loss function e g I y h x y h x z Generalization error L h E y h x z Objective find h with minimum generalization error z Main boosting idea minimize the empirical error 1 L h N N y h x n 1 n n 3 Exponential Loss z One possible measure of empirical loss is exp y h n i m i 1 xi n exp yi h m 1 x i yi am h x i m i 1 n exp yi h m 1 x i exp yi am h x i m i 1 n Wi m 1 exp yi am h x i m i 1 z The combined classifier based on m 1 iterations defines a weighted loss criterion for the next simple classifier to add z each training sample is weighted by its classifiability or difficulty seen by the classifier we have built so far Linearization of loss function z We can simplify a bit the estimation criterion for the new component classifiers assuming is small exp yi am h x i m 1 yi am h x i m z Now our empirical loss criterion reduces to exp y h n i m i 1 x i n Wi m 1 1 yi am h x i m i 1 n n i 1 i 1 Wi m 1 am Wi m 1 yi h x i m z We could choose a new component classifier to optimize this weighted agreement 4 A possible algorithm z At stage m we find that maximize or at least give a sufficiently high weighted agreement n W i i 1 z m 1 yi h x i m z each sample is weighted by its difficulty under the previously combined m 1 classifiers z more difficult samples received heavier attention as they dominates the total loss Then we go back and find the votes m associated with the new classifier by minimizing the original weighted exponential loss n W i 1 i m 1 exp yi am h x i m Boosting z We have basically derived a Boosting algorithm that sequentially adds new component classifiers each trained on reweighted training examples z z each component classifier is presented with a slightly different problem AdaBoost preliminaries z we work with normalized weights Wi on the training examples initially uniform Wi 1 n z the weight reflect the degree of difficulty of each datum on the latest classifier 5 The AdaBoost algorithm z At the kth iteration we find any classifier h x k for which the weighted classification error 1 n 2 i 1 k 0 5 Wi k 1 yi h x i k is better than change z z Determine how many votes to assign to the new component classifier k 0 5 log 1 k k z z This is meant to be easy weak classifier stronger classifier gets more votes Update the weights on the training examples Wi k Wi k 1 exp yi ak h x i k The AdaBoost algorithm cont d z The final classifier after m boosting iterations is given by the sign of h x 1 K m h x m h x 1 1 K m z the votes here are normalized for convenience 6 AdaBoost summary z Input z N examples SN x1 y1 xN yN z a weak base learner h h x z Initialize equal example weights wi 1 N for all i 1 N z Iterate for t 1 T 1 z train base learner according to weighted example set wt x and obtain hypothesis ht h x t 2 compute hypothesis error t 3 compute hypothesis weight t 4 update example weights for next iteration wt 1 Output final hypothesis as a linear combination of ht AdaBoost dataflow diagram w1 A w S w2 A w S wT A w S 7 Boosting examples Boosting example cont d 8 Boosting example cont d Base Learners z z Weak learners used in practice z Decision stumps axis parallel splits z Decision trees e g C4 5 by Quinlan 1996 z Multi layer neural networks z Radial basis function networks Can base learners operate on weighted examples z In many cases they can be modified to accept weights along with the examples z In general we can sample the examples with replacement according to the distribution defined by the weights 9 Boosting performance z The error rate of component classifier the decision stumps does not improve much if at all over time z But both training and testing error improve over time z Even after the training error of the combined classifier goes to zero boosting iterations can still improve the generalization error Why it is working z You will need some learning theory to be covered in the next two lectures to understand this fully but for now let s just go over some high level ideas z Generalization Error With high probability Generalization error is less than As T goes up our bound becomes worse Boosting should overfit 10 Experiments Test error Training error The Boosting Approach to Machine Learning by Robert E Schapire Training Margins z When a vote is taken the more predictors agreeing the more confident you are in your prediction z Margin for example h x K m h x i m margin h x i yi yi 1 i …
View Full Document