MSRI Workshop on Nonlinear Estimation and Classification 2002 The Boosting Approach to Machine Learning An Overview Robert E Schapire AT T Labs Research Shannon Laboratory 180 Park Avenue Room A203 Florham Park NJ 07932 USA www research att com schapire December 19 2001 Abstract Boosting is a general method for improving the accuracy of any given learning algorithm Focusing primarily on the AdaBoost algorithm this chapter overviews some of the recent work on boosting including analyses of AdaBoost s training error and generalization error boosting s connection to game theory and linear programming the relationship between boosting and logistic regression extensions of AdaBoost for multiclass classification problems methods of incorporating human knowledge into boosting and experimental and applied work using boosting 1 Introduction Machine learning studies automatic techniques for learning to make accurate predictions based on past observations For example suppose that we would like to build an email filter that can distinguish spam junk email from non spam The machine learning approach to this problem would be the following Start by gathering as many examples as posible of both spam and non spam emails Next feed these examples together with labels indicating if they are spam or not to your favorite machine learning algorithm which will automatically produce a classification or prediction rule Given a new unlabeled email such a rule attempts to predict if it is spam or not The goal of course is to generate a rule that makes the most accurate predictions possible on new test examples 1 Building a highly accurate prediction rule is certainly a difficult task On the other hand it is not hard at all to come up with very rough rules of thumb that are only moderately accurate An example of such a rule is something like the following If the phrase buy now occurs in the email then predict it is spam Such a rule will not even come close to covering all spam messages for instance it really says nothing about what to predict if buy now does not occur in the message On the other hand this rule will make predictions that are significantly better than random guessing Boosting the machine learning method that is the subject of this chapter is based on the observation that finding many rough rules of thumb can be a lot easier than finding a single highly accurate prediction rule To apply the boosting approach we start with a method or algorithm for finding the rough rules of thumb The boosting algorithm calls this weak or base learning algorithm repeatedly each time feeding it a different subset of the training examples or to be more precise a different distribution or weighting over the training examples1 Each time it is called the base learning algorithm generates a new weak prediction rule and after many rounds the boosting algorithm must combine these weak rules into a single prediction rule that hopefully will be much more accurate than any one of the weak rules To make this approach work there are two fundamental questions that must be answered first how should each distribution be chosen on each round and second how should the weak rules be combined into a single rule Regarding the choice of distribution the technique that we advocate is to place the most weight on the examples most often misclassified by the preceding weak rules this has the effect of forcing the base learner to focus its attention on the hardest examples As for combining the weak rules simply taking a weighted majority vote of their predictions is natural and effective There is also the question of what to use for the base learning algorithm but this question we purposely leave unanswered so that we will end up with a general boosting procedure that can be combined with any base learning algorithm Boosting refers to a general and provably effective method of producing a very accurate prediction rule by combining rough and moderately inaccurate rules of thumb in a manner similar to that suggested above This chapter presents an overview of some of the recent work on boosting focusing especially on the AdaBoost algorithm which has undergone intense theoretical study and empirical testing 1 A distribution over training examples can be used to generate a subset of the training examples simply by sampling repeatedly from the distribution 2 Given x1 y1 xm ym where xi Initialize D1 i 1 m For t 1 T 2 X yi 2 Y f 1 1g Train base learner using distribution Dt Get base classifier ht X R Choose t 2 R Update Dt 1 i Dt i exp Z t yiht xi t where Zt is a normalization factor chosen so that Dt 1 will be a distribution Output the final classifier H x sign T X t 1 t ht x Figure 1 The boosting algorithm AdaBoost 2 AdaBoost Working in Valiant s PAC probably approximately correct learning model 75 Kearns and Valiant 41 42 were the first to pose the question of whether a weak learning algorithm that performs just slightly better than random guessing can be boosted into an arbitrarily accurate strong learning algorithm Schapire 66 came up with the first provable polynomial time boosting algorithm in 1989 A year later Freund 26 developed a much more efficient boosting algorithm which although optimal in a certain sense nevertheless suffered like Schapire s algorithm from certain practical drawbacks The first experiments with these early boosting algorithms were carried out by Drucker Schapire and Simard 22 on an OCR task The AdaBoost algorithm introduced in 1995 by Freund and Schapire 32 solved many of the practical difficulties of the earlier boosting algorithms and is the focus of this paper Pseudocode for AdaBoost is given in Fig 1 in the slightly generalized form given by Schapire and Singer 70 The algorithm takes as input a training set x1 y1 xm ym where each xi belongs to some domain or instance space X and each label yi is in some label set Y For most of this paper we assume Y f 1 1g in Section 7 we discuss extensions to the multiclass case AdaBoost calls a given weak or base learning algorithm repeatedly in a series 3 of rounds t 1 T One of the main ideas of the algorithm is to maintain a distribution or set of weights over the training set The weight of this distribution on training example i on round t is denoted Dt i Initially all weights are set equally but on each round the weights of incorrectly classified examples are increased so that the base learner is forced to focus on the hard examples in the training set The base learner s
View Full Document