Boosting for Learning Multiple Classes with Imbalanced Class Distribution

Home> Academic Documents> Boosting for Learning Multiple Classes with Imbalanced Class Distribution

DOC PREVIEW

This preview shows page 1-2-3-4 out of 11 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Boosting for Learning Multiple Classes with Imbalanced Class DistributionYanmin SunDepartment of Electricaland Computer EngineeringUniversity of WaterlooWaterloo, Ontario, [email protected] S. KamelDepartment of Electricaland Computer EngineeringUniversity of WaterlooWaterloo, Ontario, [email protected] WangPattern DiscoverySoftware Systems Ltd.554 Parkside DriveWaterloo, Ontario, [email protected] of data with imbalanced class distributionhas posed a significant drawback of the performance attain-able by most standard classifier learning algorithms, whichassume a relatively balanced class distribution and equalmisclassification costs. This learning difficulty attracts a lotof research interests. Most efforts concentrate on bi-classproblems. However, bi-class is not the only scenario wherethe class imbalance problem prevails. Reported solutionsfor bi-class applications are not applicable to multi-classproblems. In this paper, we develop a cost-sensitive boost-ing algorithm to improve the classification performance ofimbalanced data involving multiple classes. One barrier ofapplying the cost-sensitive boosting algorithm to the imbal-anced data is that the cost matrix is often unavailable for aproblem domain. To solve this problem, we apply GeneticAlgorithm to search the optimum cost setup of each class.Empirical tests show that the proposed cost-sensitive boost-ing algorithm improves the classification performances ofimbalanced data sets significantly.1 IntroductionClassification is an important task of knowledge discov-ery in databases (KDD) and data mining. Recently, reportsfrom both academy and industry indicate that the imbal-anced class distribution of a data set has posed a serious dif-ficulty to most classifier learning algorithms which assumea relatively balanced distribution [9, 12]. Imbalanced classdistribution is characterized as that there are many moreinstances of some classes than others. With imbalanceddata, classification rules that predict the small classes tendto be fewer and weaker than those that predict the prevalentclasses; consequently, test samples belonging to the smallclasses are misclassified more often than those belongingto the prevalent classes. Standard classifiers usually per-form poorly on imbalanced data sets because they are de-signed to generalize from training data and output the sim-plest hypothesis that best fits the data. Therefore, the sim-plest hypothesis pays less attention to rare cases. However,in many cases, identifying rare objects is of crucial impor-tance; classification performances on the small classes arethe main concerns in determining the property of a classifi-cation model.The difficulty raised by the class imbalance problem withboth academic research and practical applications in thecommunity of machine learning and data mining attractsa lot of research interests. Reported works focus on threeaspects of the class imbalance problem: 1) what are theproper evaluation measures of classification performance inthe presence of the class imbalance problem? 2) what isthe nature of the class imbalance problem, i.e. in what do-mains do class imbalances most hinder the performance ofa standard classifier? [9]; and 3) what are the possible so-lutions in dealing with the class imbalance problem? Withregard to the first aspect, it is stated that accuracy is tradi-tionally the most commonly used measure in both assessingthe classification models and guiding the search algorithms.However, for a classification model induced from a data setwith imbalanced class distribution, accuracy is no longer aproper measure since rare classes have very few impact onaccuracy than prevalent classes [11]. Some other evaluationmeasures, such as recall, precision, F-measure, G-mean andReceiver Operation Characteristic (ROC) Curve Analysis,are then explored and proposed as more proper evaluationmeasures [1, 10, 16]. With respect to the second aspect, athorough study can be found in [9]. Other relevant worksare reported in [10, 23, 24]. These studies show that theimbalanced class distribution is not the only factor that hin-ders the classification performance. Other factors that de-teriorate the performance include the training sample size,the separability and the presences of sub-concepts within agroup. The third aspect is the focus of most publications ad-Proceedings of the Sixth International Conference on Data Mining (ICDM'06)0-7695-2701-9/06 $20.00 © 2006dressing the class imbalance problem. Almost all reportedsolutions are designed for the bi-class scenario.In a bi-class application, the imbalanced problem is ob-served as one class is represented by a large amount of sam-ples while the other is represented by only a few. The classwith very few training samples and usually associated withhigh identification importance, is referred as the positiveclass; the other one as the negative class. The learning ob-jective of this kind of data is to obtain a satisfactory identi-fication performance on the positive (small) class. Reportedsolutions for the bi-class applications can be categorized asdata level and algorithm level approaches [2]. At the datalevel, the objective is to re-balance the class distributionby re-sampling the data space including oversampling in-stances of the positive class and undersampling instances ofthe negative class, sometimes, uses the combination of thetwo techniques [2]. At the algorithm level, solutions try toadapt the existing classifier learning algorithms to bias to-wards the positive class, such as cost sensitive learning [15]and recognition based learning [8]. In addition to these so-lutions, another approach is boosting. Boosting algorithmschange the underlying data distribution and apply the stan-dard classifier learning algorithms to the revised data spaceiteratively. From this point of view, boosting approachesshould be categorized as solutions at data level.The AdaBoost (Adaptive Boosting) algorithm [5, 19] isreported as an effective boosting algorithm to improve clas-sification accuracies of any “weak” learning algorithms. Itweighs each sample reflecting its importance and placesthe most weights on those examples which are most of-ten misclassified by the preceding classifiers. This forcesthe following learning to concentrate on those samples hardto be correctly classified. When the AdaBoost algorithmis adapted to tackle the class imbalance


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 11 pages.

Please select your school