# A New Boosting Algorithm Using Input-Dependent Regularizer

**View the full content.**View Full Document

0 0 18 views

**Unformatted text preview:**

The Twentieth International Conference on Machine Learning ICML 03 Washington DC August 21 24 2003 A New Boosting Algorithm Using Input Dependent Regularizer Rong Jin rong cs cmu edu Yan Liu yanliu cs cmu edu Luo Si lsi cs cmu edu Jaime Carbonell jgc cs cmu edu Alexander G Hauptmann alex cs cmu edu School of Computer Science Carnegie Mellon University Pittsburgh PA 15213 8213 USA Abstract AdaBoost has proved to be an effective method to improve the performance of base classifiers both theoretically and empirically However previous studies have shown that AdaBoost might suffer from the overfitting problem especially for noisy data In addition most current work on boosting assumes that the combination weights are fixed constants and therefore does not take particular input patterns into consideration In this paper we present a new boosting algorithm WeightBoost which tries to solve these two problems by introducing an inputdependent regularization factor to the combination weight Similarly to AdaBoost we derive a learning procedure for WeightBoost which is guaranteed to minimize training errors Empirical studies on eight different UCI data sets and one text categorization data set show that WeightBoost almost always achieves a considerably better classification accuracy than AdaBoost Furthermore experiments on data with artificially controlled noise indicate that the WeightBoost is more robust to noise than AdaBoost 1 Introduction As a generally effective algorithm to create a strong classifier out of a weak classifier boosting has gained popularity recently Boosting works by repeatedly running the weak classifier on various training examples sampled from the original training pool and combining the base classifiers produced by the week learner into a single composite classifier AdaBoost has been theoretically proved and empirically shown to be an effective method to improve the classification accuracy Through the particular weight updating procedure AdaBoost is able to focus on those data points that have been misclassified in previous training iterations and therefore minimizes the training errors Since AdaBoost is a greedy algorithm and intentionally focuses on the minimization of training errors there have been many studies on the issues of overfitting for AdaBoost Quinlan 1996 Grove Schuurmans 1998 Ratsch et al 1998 The general conclusion from early studies appears to be that in practice AdaBoost seldom overfits the training data namely even though the AdaBoost algorithm greedily minimizes the training errors via gradient decent the testing error usually goes down accordingly Recent studies have implied that this phenomena might be related to the fact that the greedy search procedure used in AdaBoost is able to implicitly maximize the classification margin Onoda et al 1998 Friedman et al 1998 However other studies Opitz Macline 1999 Jiang 2000 Ratsch et al 2000 Grove Schuurmans 1998 Dietterich 2000 have shown that AdaBoost might have the problem of overfitting particularly when the data are noisy In fact noise in the data can be introduced by two factors either the mislabelled data or the limitation of the hypothesis space of the base classifier When noise level is high there could be some data patterns that are difficult for the classifiers to capture Therefore the boosting algorithm is forced to focus on those noisy data patterns and thereby distort the optimal decision boundary As a result the decision boundary will only be suitable for those difficult data patterns and not necessarily general enough for other data The overfitting problem can also be analyzed from the viewpoint of generalized error bound Ratsch et al 2000 As discussed in Ratsch et al 2000 the margin maximized by AdaBoost is actually a hard margin namely the smallest margin of those noisy data patterns As a consequence the margin of the Proceedings of the Twentieth International Conference on Machine Learning ICML 2003 Washington DC 2003 other data points may decrease significantly when we maximize the hard margin and thus force the generalized error bound Schapire 1999 to increase In order to deal with the overfitting problems in Adaboost several strategies have been proposed such as smoothing Schapire Singer 1998 Gentle Boost J Friedman Tibshirani 1998 BrownBoost Freund 2001 Weight Decay Ratsch et al 1998 and regularized AdaBoost Ratsch et al 2000 The main ideas of these methods can be summarized into two groups one is changing the cost function such as introducing regularization factors into the cost function the other is introducing a soft margin The problem of overfitting for AdaBoost may be related to the exponential cost function which makes the weights of the noisy data grow exponentially and leads to the overemphasis of those data patterns The solution to this issue can be either to introduce a different cost function such as a logistic regression function in Friedman et al 1998 or to regularize the exponential cost function with a penalty term such as the weight decay method used in Ratsch et al 1998 or to introduce a different weighting function such as BrownBoost Freund 2001 A more general solution is to replace the hard margin in AdaBoost with a soft margin Similar to the strategy used in support vector machine SVM algorithm Cortes Vapnik 1995 the boosting algorithm with a soft margin is able to allow a larger margin at the expenses of some misclassification errors This idea leads to works such as the regularized boosting algorithms using both linear programming and quadratic programming Ratsch et al 1999 However there is another problem with AdaBoost that has been overlooked in previous studies The AdaBoost algorithm employs a linear combination strategy that is combining different base classifiers with a set of constants As illustrated in previous studies one advantage of the ensemble approach over a single model is that an ensemble approach allows each sub model to cover a different aspect of the data set By combining them together we are able to explain the whole data set thoroughly Therefore in order to take full strength of each submodel a good combination strategy should be able to examine the input pattern and invoke the sub models that are only appropriate for the input pattern For example in the Hierarchical Mixture Expert model Jordan Jacobs 1994 the sub models are organized into a tree structure with leaf nodes acting as experts and internal nodes as gates Those internal nodes