Pitt CS 2750 - Dimensionality reduction Feature selection - D3208029

Home> Schools> University of Pittsburgh> Computer Science (CS) > CS 2750> Dimensionality reduction Feature selection

DOC PREVIEW

Pitt CS 2750 - Dimensionality reduction Feature selection

School name University of Pittsburgh

Course Cs 2750- Machine Learning

Pages 15

This preview shows page 1-2-3-4-5 out of 15 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1 CS 2750 Machine Learning CS 2750 Machine Learning Lecture 20 Milos Hauskrecht [email protected] 5329 Sennott Square Dimensionality reduction Feature selection CS 2750 Machine Learning Dimensionality reduction. Motivation. • Is there a lower dimensional representation of the data that captures well its characteristics? • Assume: – We have an data such that – Assume the dimension d of the data point x is very large – We want to analyze x • Methods of analysis are sensitive to the dimensionality d • Our goal: Find a lower dimensional representation of data • Two learning problems: – supervised – unsupervised }{N21x,..,x,x),..,,(21 diiiixxxx2 CS 2750 Machine Learning Dimensionality reduction for classification • Classification problem example: – We have an input data such that and a set of corresponding output labels – Assume the dimension d of the data point x is very large – We want to classify x • Problems with high dimensional input vectors – A large number of parameters to learn, if a dataset is small this can result in: • Large variance of estimates and overfit – it becomes hard to explain what features are important in the model (too many choices some can be substitutable) }{N21x,..,x,x),..,,(21 diiiixxxx},..,,{21 Nyyy CS 2750 Machine Learning Dimensionality reduction • Solutions: – Selection of a smaller subset of inputs (features) from a large set of inputs; train classifier on the reduced input set – Combination of high dimensional inputs to a smaller set of features ; train classifier on new features x… … … … … … selection combination )(xk3 CS 2750 Machine Learning Feature selection How to find a good subset of inputs/features? • We need: – A criterion for ranking good inputs/features – Search procedure for finding a good set of features • Feature selection process can be: – Dependent on the learning task • e.g. classification • Selection of features affected by what we want to predict – Independent of the learning task • Unsupervised methods • may lack the accuracy for classification/regression tasks CS 2750 Machine Learning Task-dependent feature selection Assume: • Classification problem: – x – input vector, y - output Objective: Find a subset of inputs/features that gives/preserves most of the output prediction capabilities Selection approaches: • Filtering approaches – Filter out features with small predictive potential – done before classification; typically uses univariate analysis • Wrapper approaches – Select features that directly optimize the accuracy of the multivariate classifier • Embedded methods – Feature selection and learning closely tied in the method4 CS 2750 Machine Learning Feature selection through filtering • Assume: – Classification problem: x – input vector, y - output – Inputs in x or some fixed feature mappings • How to select the feature: – Univariate analysis • Pretend that only one variable, , exists • See how well it predicts the output y alone – Example: • differentially expressed features (or inputs) • Good separation in binary (case/control settings) )(xkkx CS 2750 Machine Learning Differentially expressed features • Scores for measuring the differential expression – T-Test score (Baldi & Long) • Based on the test that two groups come from the same population – Fisher Score – AUROC score: Area under Receiver Operating Characteristic curve Problems: – if many random features, and not many instances we can learn from the features with a good differentially expressed score must arise – Techniques to reduce FDR (False discovery rate) and FWER (Family wise error). 2)(2)(2)(2)()(iiiiiFisher5 CS 2750 Machine Learning Feature filtering Other univariate scores: • Correlation coefficients – Measures linear dependences • Mutual information • Univariate assumptions: – Only one feature and its effect on y is incorporated in the mutual information score – Effects of two features on y are independent • What to do if the combination of features gives the best prediction? )(~)(~),(~log),(~),(2iyPjPiyjPiyjPyIkkikjk)()(),(),(yVarVaryCovykkk CS 2750 Machine Learning Feature selection: dependent features Filtering with dependent features • Let be a current set of features (starting from complete set) • We can remove feature from it when: • Repeat removals until the probabilities differ. Problem: how to compute/estimate ? Solution: make some simplifying assumption about the underlying probabilistic model • Example: use a Naïve Bayes • Advantage: speed, modularity, applied before classification • Disadvantage: may not be as accurate φ)|(~)\|(~φφ yPyPkfor all values of yk,)(xk)|(~),\|(~φφ yPyPk6 CS 2750 Machine Learning Feature selection: wrappers Wrapper approach: • The feature selection is driven by the prediction accuracy of the classifier (regressor) we actually want to built How to find the appropriate feature set? • For d binary features there are 2d different feature subsets • Idea: Greedy search in the space of classifiers – Gradually add features improving most the quality score – Gradually remove features that effect the accuracy the least – Score should reflect the accuracy of the classifier (error) and also prevent overfit • Standard way to measure the quality: – Internal cross-validation (m-fold cross validation) CS 2750 Machine Learning Internal cross-validation • Split train set: to internal train and test sets • Internal train set: train different models (defined e.g. on different subsets of features) • Internal test set/s: estimate the generalization error and select the best model among possible models • Internal cross-validation (m-fold): – Divide the train data into m equal partitions (of size N/m) – Hold out one partition for validation, train the classifiers on the rest of data – Repeat such that every partition is held out once – The estimate of the generalization error of the learner is the mean of errors of on all partitions7 CS 2750 Machine Learning Feature selection:

View Full Document