New version page

Toward Integrating Feature Selection Algorithms for Classification and Clustering

This preview shows page 1-2-16-17-18-34-35 out of 35 pages.

View Full Document
View Full Document

End of preview. Want to read all 35 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Toward Integrating Feature Selection Algorithms forClassification and ClusteringHuan Liu and Lei YuDepartment of Computer Science and EngineeringArizona State UniversityTempe, AZ 85287-8809{hliu,leiyu}@asu.eduAbstractThis paper introduces concepts and algorithms of feature selection, surveys existing featureselection algorithms for classification and clustering, groups and compares different algorithmswith a categorizing framework based on search strategies, evaluation criteria, and data miningtasks, reveals unattempted combinations, and provides guidelines in selection of feature se-lection algorithms. With the categorizing framework, we continue our efforts toward buildingan integrated system for intelligent feature selection. A unifying platform is proposed as anintermediate step. An illustrative example is presented to show how existing feature selectionalgorithms can be integrated into a meta algorithm that can take advantage of individual algo-rithms. An added advantage of doing so is to help a user employ a suitable algorithm withoutknowing details of each algorithm. Some real-world applications are included to demonstratethe use of feature selection in data mining. We conclude this work by identifying trends andchallenges of feature selection research and development.Keywords: Feature Selection, Classification, Clustering, Categorizing Framework, UnifyingPlatform, Real-World Applications11 IntroductionAs computer and database technologies advance rapidly, data accumulates in a speed unmatchableby human’s capacity of data processing. Data mining [1, 29, 35, 36], as a multidisciplinary jointeffort from databases, machine learning, and statistics, is championing in turning mountains of datainto nuggets. Researchers and practitioners realize that in order to use data mining tools effectively,data preprocessing is essential to successful data mining [53, 74]. Feature selection is one of theimportant and frequently used techniques in data preprocessing for data mining [6, 52]. It reducesthe number of features, removes irrelevant, redundant, or noisy data, and brings the immediateeffects for applications: speeding up a data mining algorithm, improving mining performance suchas predictive accuracy and result comprehensibility. Feature selection has been a fertile field ofresearch and development since 1970’s in statistical pattern recognition [5, 40, 63, 81, 90], machinelearning [6, 41, 43, 44], and data mining [17, 18, 42], and widely applied to many fields such astext categorization [50, 70, 94], image retrieval [77, 86], customer relationship management [69],intrusion detection [49], and genomic analysis [91].Feature selection is a process that selects a subset of original features. The optimality of afeature subset is measured by an evaluation criterion. As the dimensionality of a domain expands,the number of features N increases. Finding an optimal feature subset is usually intractable [44]and many problems related to feature selection have been shown to be NP-hard [7]. A typicalfeature selection process consists of four basic steps (shown in Figure 1), namely, subset genera-tion, subset evaluation, stopping criterion, and result validation [18]. Subset generation is a searchprocedure [48, 53] that produces candidate feature subsets for evaluation based on a certain searchstrategy. Each candidate subset is evaluated and compared with the previous best one accordingto a certain evaluation criterion. If the new subset turns out to be better, it replaces the previousbest subset. The process of subset generation and evaluation is repeated until a given stopping cri-terion is satisfied. Then the selected best subset usually needs to be validated by prior knowledgeor different tests via synthetic and/or real-world data sets. Feature selection can be found in manyareas of data mining such as classification, clustering, association rules, regression. For example,2OriginalSetSubsetGenerationSubsetSubsetEvaluationGoodnessof subsetStoppingCriterionNo YesResultValidationFigure 1: Four key steps of feature selectionfeature selection is called subset or variable selection in Statistics [62]. A number of approaches tovariable selection and coeffi cient shrinkage for regression are summarized in [37]. In this survey,we focus on feature selection algorithms for classification and clustering. Early research effortsmainly focus on feature selection for classification with labeled data [18, 25, 81] (supervised fea-ture selection) where class information is available. Latest developments, however, show that theabove general procedure can be well adopted to feature selection for clustering with unlabeleddata [19, 22, 27, 87] (or unsupervised feature selection) where data is unlabeled.Feature selection algorithms designed with different evaluation criteria broadly fall into threecategories: the filter model [17, 34, 59, 95], the wrapper model [13, 27, 42, 44], and the hybridmodel [15, 68, 91]. The filter model relies on general characteristics of the data to evaluate andselect feature subsets without involving any mining algorithm. The wrapper model requires onepredetermined mining algorithm and uses its performance as the evaluation criterion. It searchesfor features better suited to the mining algorithm aiming to improve mining performance, but italso tends to be more computationally expensive than the filter model [44, 48]. The hybrid modelattempts to take advantage of the two models by exploiting their different evaluation criteria indifferent search stages.This survey attempts to review the field of feature selection based on earlier works by Doak [25],Dash and Liu [18], and Blum and Langley [6]. The fast development of the field has producedmany new feature selection methods. Novel research problems and applications emerge, and newdemands for feature selection appear. In order to review the field and attempt for the next genera-3tion of feature selection methods, we aim to achieve the following objectives in this survey:• introduce the basic notions, concepts, and procedures of feature selection,• describe the state-of-the-art feature selection techniques,• identify existing problems of feature selection and propose ways of solving them,• demonstrate feature selection in real-world applications, and• point out current trends and future directions.This survey presents a collection of existing feature selection algorithms, and proposes a cate-gorizing framework that


Loading Unlocking...
Login

Join to view Toward Integrating Feature Selection Algorithms for Classification and Clustering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Toward Integrating Feature Selection Algorithms for Classification and Clustering and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?