MSRI Workshop on Nonlinear Estimation and Classification, 2002.The Boosting Approach to Machine LearningAn OverviewRobert E. SchapireAT&T Labs;ResearchShannon Laboratory180 Park Avenue, Room A203Florham Park, NJ 07932 USAwww.research.att.com/schapireDecember 19, 2001AbstractBoosting is a general method for improving the accuracy of any givenlearning algorithm. Focusing primarily on the AdaBoost algorithm, thischapter overviews some of the recent work on boosting including analysesof AdaBoost’s training error and generalization error; boosting’s connectionto game theory and linear programming; the relationship between boostingand logistic regression; extensions of AdaBoost for multiclass classificationproblems; methods of incorporating human knowledge into boosting; andexperimental and applied work using boosting.1 IntroductionMachine learning studies automatic techniques for learning to make accurate pre-dictions based on past observations. For example, suppose that we would like tobuild an email filter that can distinguish spam (junk) email from non-spam. Themachine-learning approach to this problem would be the following: Start by gath-ering as many examples as posible of both spam and non-spam emails. Next, feedthese examples, together with labels indicating if they are spam or not, to yourfavorite machine-learning algorithm which will automatically produce a classifi-cation or prediction rule. Given a new, unlabeled email, such a rule attempts topredict if it is spam or not. The goal, of course, is to generate a rule that makes themost accurate predictions possible on new test examples.1Building a highly accurate prediction rule is certainly a difficult task. On theother hand, it is not hard at all to come up with very rough rules of thumb thatare only moderately accurate. An example of such a rule is something like thefollowing: “If the phrase ‘buy now’ occurs in the email, then predict it is spam.”Such a rule will not even come close to covering all spam messages; for instance,it really says nothing about what to predict if ‘buy now’ does not occur in themessage. On the other hand, this rule will make predictions that are significantlybetter than random guessing.Boosting, the machine-learning method that is the subject of this chapter, isbased on the observation that finding many rough rules of thumb can be a lot easierthan finding a single, highly accurate prediction rule. To apply the boosting ap-proach, we start with a method or algorithm for finding the rough rules of thumb.The boosting algorithm calls this “weak” or “base” learning algorithm repeatedly,each time feeding it a different subset of the training examples (or, to be more pre-cise, a different distribution or weighting over the training examples1). Each timeit is called, the base learning algorithm generates a new weak prediction rule, andafter many rounds, the boosting algorithm must combine these weak rules into asingle prediction rule that, hopefully, will be much more accurate than any one ofthe weak rules.To make this approach work, there are two fundamental questions that must beanswered: first, how should each distribution be chosen on each round, and second,how should the weak rules be combined into a single rule? Regarding the choiceof distribution, the technique that we advocate is to place the most weight on theexamples most often misclassified by the preceding weak rules; this has the effectof forcing the base learner to focus its attention on the “hardest” examples. Asfor combining the weak rules, simply taking a (weighted) majority vote of theirpredictions is natural and effective.There is also the question of what to use for the base learning algorithm, butthis question we purposely leave unanswered so that we will end up with a generalboosting procedure that can be combined with any base learning algorithm.Boosting refers to a general and provably effective method of producing a veryaccurate prediction rule by combining rough and moderately inaccurate rules ofthumb in a manner similar to that suggested above. This chapter presents anoverview of some of the recent work on boosting, focusing especially on the Ada-Boost algorithm which has undergone intense theoretical study and empirical test-ing.1A distribution over training examples can be used to generate a subset of the training examplessimply by sampling repeatedly from the distribution.2Given:(x1y1):::(xmym)wherexi2X,yi2Y=f;1+1gInitializeD1(i)=1=m.Fort=1:::T: Train base learner using distributionDt. Get base classifierht:X!R. Chooset2R. Update:Dt+1(i)=Dt(i)exp(;tyiht(xi))ZtwhereZtis a normalization factor (chosen so thatDt+1will be a distribu-tion).Output the final classifier:H(x)=signTXt=1tht(x)!:Figure 1: The boosting algorithm AdaBoost.2 AdaBoostWorking in Valiant’s PAC (probably approximately correct) learning model [75],Kearns and Valiant [41, 42] were the first to pose the question of whether a “weak”learning algorithm that performs just slightly better than random guessing can be“boosted” into an arbitrarily accurate “strong” learning algorithm. Schapire [66]came up with the first provable polynomial-time boosting algorithm in 1989. Ayear later, Freund [26] developed a much more efficient boosting algorithm which,although optimal in a certain sense, nevertheless suffered like Schapire’s algorithmfrom certain practical drawbacks. The first experiments with these early boostingalgorithms were carried out by Drucker, Schapire and Simard [22] on an OCR task.The AdaBoost algorithm, introduced in 1995 by Freund and Schapire [32],solved many of the practical difficulties of the earlier boosting algorithms, and isthe focus of this paper. Pseudocode for AdaBoost is given in Fig. 1 in the slightlygeneralized form given by Schapire and Singer [70]. The algorithm takes as inputa training set(x1y1):::(xmym)where eachxibelongs to some domain orinstance spaceX, and each labelyiis in some label setY. For most of this paper,we assumeY=f;1+1g; in Section 7, we discuss extensions to the multiclasscase. AdaBoost calls a given weak or base learning algorithm repeatedly in a series3of roundst= 1:::T. One of the main ideas of the algorithm is to maintain adistribution or set of weights over the training set. The weight of this distribution ontraining exampleion roundtis denotedDt(i). Initially, all weights are set equally,but on each round, the
View Full Document