Pitt CS 2750 - Dimensionality reduction feature selection - D1921907

Home> Schools> University of Pittsburgh> Computer Science (CS) > CS 2750> Dimensionality reduction feature selection

Pitt CS 2750 - Dimensionality reduction feature selection

Pages 15

Download Save

Unformatted text preview:

1CS 2750 Machine LearningCS 2750 Machine LearningLecture 20Milos [email protected] Sennott SquareDimensionality reductionFeature selectionCS 2750 Machine LearningDimensionality reduction. Motivation.• Classification problem example:– We have an input data such that and a set of corresponding output labels– Assume the dimension d of the data point x is very large– We want to classify x• Problems with high dimensional input vectors– large number of parameters to learn, if a dataset is small this can result in: • Large variance of estimates•Overfit– irrelevant attributes (near duplicates, poor predictors)}{N21x,..,x,x),..,,(21 diiiixxx=x},..,,{21 Nyyy2CS 2750 Machine LearningDimensionality reduction.• Solutions:– Selection of a smaller subset of inputs (features) from a large set of inputs; train classifier on the reduced input set– Combination of high dimensional inputs to a smaller set of features ; train classifier on new featuresx………………selectioncombination)(xkφCS 2750 Machine LearningDimensionality reduction.How to find the right subset of inputs/features?• We need:– A criterion for ranking good inputs/features– Search procedure for finding a good set of features• Feature selection process can be:– Dependent on the original learning task• e.g. classification or regression• Selection of features affected by what we want to predict– Independent of the learning task• E.g. looks at inputs of a classification problem and tries to reduce their description without regard to output– PCA, component analysis, clustering of inputs• May lack the accuracy3CS 2750 Machine LearningTask-dependent feature selectionAssume:• Classification problem: x – input vector, y - output• Feature mappings Objective: Find a subset of features that gives/preserves most of the output prediction capabilities Selection approaches: • Filtering approaches– Filter out features with small potential to predict outputs well– Uses univariate analysis - done before classification• Wrapper approaches– Select features that directly optimize the accuracy of the classifier}),(),(),({21KK xxxφkφφφ=CS 2750 Machine LearningFeature selection through filtering• Assume:– Classification problem: x – input vector, y - output– Inputs in x or feature mappings• How to select the feature:– Use univariate analysis• Pretend that only one variable, , exists• See how well it predicts the output y alone– Differentially expressed features (or inputs)• Good separation in binary settings (2 classes) )(xkφkx4CS 2750 Machine LearningFeature selection through filteringDifferentially expressed features• Criteria for measuring the differential expression– T-Test score (Baldi & Long)• Based on the test that two groups come from the same population– Fisher Score– Area under Receiver Operating Characteristic (AUC) scoreProblem:– if many random features, the features with a good differentially expressed score must arise – Techniques to reduce FDR (False discovery rate) −+−++−=iiiiiFisherσσµµ)(CS 2750 Machine LearningMutual information filtering• Mutual informationAn output of a feature function behaves like a random variable- a random variable representing the output of feature function k• Using ML or MAP parameter estimate we can estimate the following probabilitiesand subsequently compute)(xkφ)|(~iyPk=φ)(~iyP =),(~)(~iyPPikk==∑φφ)(~)|(~),(~iyPiyPiyPkk====φφ5CS 2750 Machine LearningSelection based on mutual information• Objective:– We want to pick only features that provide substantial information about y• Mutual information measures this dependencyFiltering method: pick the feature that exceeds some threshold)(~)(~),(~log),(~),(2iyPjPiyjPiyjPyIkkikjk=======∑∑φφφφ- If and y are independent random variables then1)(~)(~),(~=====iyPjPiyjPkkφφkφCS 2750 Machine LearningSelection based on mutual information• Other similar scores: • correlation coefficients– Measures linear dependences• What are the drawbacks? • Assumptions:– Only one feature and its effect on y is incorporated in the mutual information score – Effects of two features on y are independent• What to do if the combination of features gives the best prediction?)(~)(~),(~log),(~),(2iyPjPiyjPiyjPyIkkikjk=======∑∑φφφφ)()(),(),(yVarVaryCovykkkφφφρ=6CS 2750 Machine LearningFeature selection through filteringFiltering with dependent features• Let be a current set of features (starting from complete set)• We can remove feature from it when:• Repeat removals until the probabilities differ too much.Problem: how to compute/estimate ? Solution: make some simplifying assumption about the underlying probabilistic model• Example: use a Naïve Bayes• Advantage: speed, modularity, applied before classification• Disadvantage: may not be as accurateφ)|(~)\|(~φφ yPyPk≈φfor all values of yk,φ)(xkφ)|(~),\|(~φφ yPyPkφCS 2750 Machine LearningFeature selection using classification errorsWrapper approach:• The feature selection is driven by the prediction accuracy of the classifier (regressor) actually usedHow to find the appropriate feature set?• Idea: Greedy search in the space of classifiers – Gradually add features improving most the quality score– Score should reflect the accuracy of the classifier (error) and also prevent overfit• Two ways to measure overfit– Regularization: penalize explicitly for each feature parameter– Cross-validation (m-fold cross validation)7CS 2750 Machine LearningClassifier-dependent feature selection• Example of a greedy search:– logistic regression model with features))((),|1( xwxiiowwgypφ+==)(),|1(owgyp==wx))()((),|1(xxwxjjiiowwwgypφφ++==Choose the feature with the best score )(xiφStart withChoose the feature with the best score)(xjφWhen to stop ?Etc.CS 2750 Machine LearningCross-validation• Goal: Stop the learning when smallest generalization error (performance on the population from which data were drawn)• Test set can be used to estimate generalization error– Data different from the training set• Validation set = test set used to stop the learning process – E.g. feature selection process• Cross-validation (m-fold):– Divide the data into

View Full Document


School:
Email:
New Password:
Confirm Password:

Pitt CS 2750 - Dimensionality reduction feature selection

Sign up for free to view:

Please select your school