Pitt CS 2750 - Dimensionality reduction Feature selection - D1972332

Home> Schools> University of Pittsburgh> Computer Science (CS) > CS 2750> Dimensionality reduction Feature selection

DOC PREVIEW

Pitt CS 2750 - Dimensionality reduction Feature selection

School name University of Pittsburgh

Course Cs 2750- Machine Learning

Pages 13

This preview shows page 1-2-3-4 out of 13 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CS 2750 Machine LearningCS 2750 Machine LearningLecture 18Milos [email protected] Sennott SquareDimensionality reductionFeature selectionCS 2750 Machine LearningDimensionality reduction. Motivation.• Classification problem example:– We have an input data such that and a set of corresponding output labels– Assume the dimension d of the data point x is very large– We want to classify x• Problems with high dimensional input vectors– large number of parameters to learn, if a dataset is small this can result in: • Large variance of estimates•Overfit– irrelevant attributes (near duplicates, poor predictors)}{N21x,..,x,x),..,,(21 diiiixxx=x},..,,{21 Nyyy2CS 2750 Machine LearningDimensionality reduction.• Solutions:– Selection of a smaller subset of inputs (features) from a large set of inputs; train classifier on the reduced input set– Combination of high dimensional inputs to a smaller set of features ; train classifier on new featuresx………………selectioncombination)(xkφCS 2750 Machine LearningDimensionality reduction.How to find the right subset of inputs/features?• We need:– A criterion for ranking good inputs/features– Search procedure for finding a good set of features• Feature selection process can be:– Dependent on the original learning task• e.g. classification or regression• Selection of features affected by what we want to predict– Independent of the learning task• E.g. looks at inputs of a classification problem and tries to reduce their description without regard to output– PCA, component analysis, clustering of inputs• May lack the accuracy3CS 2750 Machine LearningTask-dependent feature selectionAssume:• Classification problem: x – input vector, y - output• Feature mappings Objective: Find a subset of features that gives/preserves most of the output prediction capabilities Selection approaches: • Filtering approaches– Filter out features with small potential to predict outputs well– Done before classification• Wrapper approaches– Select features that directly optimize the accuracy of the classifier}),(),(),({21KK xxxφkφφφ=CS 2750 Machine LearningFeature selection through filtering• Assume:– Classification problem: x – input vector, y - output– Feature mappings with discrete values• Note that: An output of a feature function behaves like a random variable- a random variable representing the output of feature function k• Using ML or MAP parameter estimate we can estimate the following probabilitiesand subsequently compute)(xkφ)|(~iyPk=φ)(~iyP =),(~)(~iyPPikk==∑φφkφ)(~)|(~),(~iyPiyPiyPkk====φφ4CS 2750 Machine LearningSelection based on mutual information• Objective:– We want to pick only features that provide substantial information about y• Mutual information measures this dependency• Method: select only features (inputs) with mutual information exceeding some threshold value )(~)(~),(~log),(~),(2iyPjPiyjPiyjPyIkkikjk=======∑∑φφφφ- If and y are independent random variables then1)(~)(~),(~=====iyPjPiyjPkkφφkφCS 2750 Machine LearningSelection based on mutual information• Other similar scores: • correlation coefficients– Measures linear dependences• What are the drawbacks? • Assumptions:– Only one feature and its effect on y is incorporated in the mutual information score – Effects of two features on y are independent• What to do if the combination of features gives the best prediction?)(~)(~),(~log),(~),(2iyPjPiyjPiyjPyIkkikjk=======∑∑φφφφ)()(),(),(yVarVaryCovykkkφφφρ=5CS 2750 Machine LearningFeature selection through filteringFiltering with dependent features• Let be a current set of features (starting from complete set)• We can remove feature from it when:• Repeat removals until the probabilities differ too much.Problem: how to compute/estimate ? Solution: make some simplifying assumption about the underlying probabilistic model• Example: use a Naïve Bayes• Advantage: speed, modularity, applied before classification• Disadvantage: may not be as accurateφ)|(~)\|(~φφ yPyPk≈φfor all values of yk,φ)(xkφ)|(~),\|(~φφ yPyPkφCS 2750 Machine LearningFeature selection using classification errorsWrapper approach:• The feature selection is driven by the prediction accuracy of the classifier (regressor) actually usedHow to find the appropriate feature set?• Idea: Greedy search in the space of classifiers – Gradually add features improving most the quality score– Score should reflect the accuracy of the classifier (error) and also prevent overfit• Two ways to measure overfit– Regularization: penalize explicitly for each feature parameter– Cross-validation (m-fold cross validation)6CS 2750 Machine LearningClassifier-dependent feature selection• Example of a greedy search:– logistic regression model with features))((),|1( xwxiiowwgypφ+==)(),|1(owgyp == wx))()((),|1(xxwxjjiiowwwgypφφ++==Choose the feature with the best score )(xiφStart withChoose the feature with the best score)(xjφWhen to stop ?Etc.CS 2750 Machine LearningCross-validation• Goal: Stop the learning when smallest generalization error (performance on the population from which data were drawn)• Test set can be used to estimate generalization error– Data different from the training set• Validation set = test set used to stop the learning process – E.g. feature selection process• Cross-validation (m-fold):– Divide the data into m equal partitions (of size N/m)– Hold out one partition for validation, train the classifier on the rest of data– Repeat such that every partition is held out once– The estimate of the generalization error of the learner is the mean of errors of all classifiers7CS 2750 Machine LearningPrincipal component analysis (PCA)• Objective: We want to replace a high dimensional input with a small set of features (obtained by combining inputs)– Different from the feature subset selection !!!• PCA:– A linear transformation of d dimensional input x to M dimensional feature vector z such that under which the retained variance is maximal.– Equivalently it is the linear projection for which the sum of squares reconstruction cost is minimized.dM <CS 2750 Machine

View Full Document