Unformatted text preview:

Intelligent Data Analysis 1 (1997) 131–156www.elsevier.com/locate/idaFeature Selection for ClassificationM. Dash1,H.Liu2Department of Information Systems & Computer Science, National University of Singapore, Singapore 119260Received 24 January 1997; revised 3 March 1997; accepted 21 March 1997AbstractFeature selection has been the focus of interest for quite some time and much work has been done. With the creation of hugedatabases and the consequent requirements for good machine learning techniques, new problems arise and novel approachesto feature selection are in demand. This survey is a comprehensive overview of many existing methods from the 1970’s to thepresent. It identifies four steps of a typical feature selection method, and categorizes the different existing methods in termsof generation procedures and evaluation functions, and reveals hitherto unattempted combinations of generation proceduresand evaluation functions. Representative methods are chosen from each category for detailed explanation and discussion viaexample. Benchmark datasets with different characteristics are used for comparative study. The strengths and weaknesses ofdifferent methods are explained. Guidelines for applying feature selection methods are given based on data types and domaincharacteristics. This survey identifies the future research areas in feature selection, introduces newcomers to this field, and pavesthe way for practitioners who search for suitable methods for solving domain-specific real-world applications. (Intelligent DataAnalysis, Vol. 1, no. 3, http:llwww.elsevier.com/locate/ida) 1997 Elsevier Science B.V. All rights reserved.Keywords: Feature selection; Classification; Framework1. IntroductionThe majority of real-world classification problems require supervised learning where the underlyingclass probabilities and class-conditional probabilities are unknown, and each instance is associated witha class label. In real-world situations, relevant features are often unknown apriori. Therefore, manycandidate features are introduced to better represent the domain. Unfortunately many of these are eitherpartially or completely irrelevant/redundant to the target concept. A relevant feature is neither irrelevantnor redundant to the target concept; an irrelevant feature does not affect the target concept in any way,and a redundant feature does not add anything new to the target concept [21]. In many applications, thesize of a dataset is so large that learning might not work as well before removing these unwanted features.Reducing the number of irrelevant/redundant features drastically reduces the running time of a learning1E-mail: [email protected]: [email protected]/97/$17.00 1997 Elsevier Science B.V. All rights reserved.PII:S1088-467X(97)00008-5132 M. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156algorithm and yields a more general concept. This helps in getting a better insight into the underlyingconcept of a real-world classification problem [23,24]. Feature selection methods try to pick a subset offeatures that are relevant to the target concept.Feature selection is defined by many authors by looking at it from various angles. But as expected,many of those are similar in intuition and/or content. The following lists those that are conceptuallydifferent and cover a range of definitions.1. Idealized: find the minimally sized feature subset that is necessary and sufficient to the targetconcept [22].2. Classical: select a subset of M features from a set of N features, M<N, such that the value of acriterion function is optimized over all subsets of size M [34].3. Improving Prediction accuracy: the aim of feature selection is to choose a subset of features forimproving prediction accuracy or decreasing the size of the structure without significantly decreasingprediction accuracy of the classifier built using only the selected features [24].4. Approximating original class distribution: the goal of feature selection is to select a small subsetsuch that the resulting class distribution, given only the values for the selected features, is as close aspossible to the original class distribution given all feature values [24].Notice that the third definition emphasizes the prediction accuracy of a classifier, built using only theselected features, whereas the last definition emphasizes the class distribution given the training dataset.These two are quite different conceptually. Hence, our definition considers both factors.Feature selection attempts to select the minimally sized subset of features according to the followingcriteria. The criteria can be:1. the classification accuracy does not significantly decrease; and2. the resulting class distribution, given only the values for the selected features, is as close as possibleto the original class distribution, given all features.Ideally, feature selection methods search through the subsets of features, and try to find the bestone among the competing 2Ncandidate subsets according to some evaluation function. However thisprocedure is exhaustive as it tries to find only the best one. It may be too costly and practically prohibitive,even for a medium-sized feature set size (N ). Other methods based on heuristic or random searchmethods attempt to reduce computational complexity by compromising performance. These methodsneed a stopping criterion to prevent an exhaustive search of subsets. In our opinion, there are four basicsteps in a typical feature selection method (see Figure 1):1. a generation procedure to generate the next candidate subset;2. an evaluation function to evaluate the subset under examination;3. a stopping criterion to decide when to stop; and4. a validation procedure to check whether the subset is valid.The generation procedure is a search procedure [46,26]. Basically, it generates subsets of features forevaluation. The generation procedure can start: (i) with no features, (ii) with all features, or (iii) witha random subset of features. In the first two cases, features are iteratively added or removed, whereasin the last case, features are either iteratively added or removed or produced randomly thereafter [26].An evaluation function measures the goodness of a subset produced by some generation procedure, andthis value is compared with the previous best. If it is found to be better, then it replaces the previous bestM. Dash, H. Liu / Intelligent Data Analysis 1 (1997) 131–156 133Fig. 1.


View Full Document

WMU CS 595 - Feature Selection for Classification

Download Feature Selection for Classification
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Feature Selection for Classification and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Feature Selection for Classification 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?