1 CS 2750 Machine Learning CS 2750 Machine Learning Lecture 20 Milos Hauskrecht [email protected] 5329 Sennott Square Dimensionality reduction Feature selection CS 2750 Machine Learning Dimensionality reduction. Motivation. • Is there a lower dimensional representation of the data that captures well its characteristics? • Assume: – We have an data such that – Assume the dimension d of the data point x is very large – We want to analyze x • Methods of analysis are sensitive to the dimensionality d • Our goal: Find a lower dimensional representation of data • Two learning problems: – supervised – unsupervised }{N21x,..,x,x),..,,(21 diiiixxxx2 CS 2750 Machine Learning Dimensionality reduction for classification • Classification problem example: – We have an input data such that and a set of corresponding output labels – Assume the dimension d of the data point x is very large – We want to classify x • Problems with high dimensional input vectors – A large number of parameters to learn, if a dataset is small this can result in: • Large variance of estimates and overfit – it becomes hard to explain what features are important in the model (too many choices some can be substitutable) }{N21x,..,x,x),..,,(21 diiiixxxx},..,,{21 Nyyy CS 2750 Machine Learning Dimensionality reduction • Solutions: – Selection of a smaller subset of inputs (features) from a large set of inputs; train classifier on the reduced input set – Combination of high dimensional inputs to a smaller set of features ; train classifier on new features x… … … … … … selection combination )(xk3 CS 2750 Machine Learning Feature selection How to find a good subset of inputs/features? • We need: – A criterion for ranking good inputs/features – Search procedure for finding a good set of features • Feature selection process can be: – Dependent on the learning task • e.g. classification • Selection of features affected by what we want to predict – Independent of the learning task • Unsupervised methods • may lack the accuracy for classification/regression tasks CS 2750 Machine Learning Task-dependent feature selection Assume: • Classification problem: – x – input vector, y - output Objective: Find a subset of inputs/features that gives/preserves most of the output prediction capabilities Selection approaches: • Filtering approaches – Filter out features with small predictive potential – done before classification; typically uses univariate analysis • Wrapper approaches – Select features that directly optimize the accuracy of the multivariate classifier • Embedded methods – Feature selection and learning closely tied in the method4 CS 2750 Machine Learning Feature selection through filtering • Assume: – Classification problem: x – input vector, y - output – Inputs in x or some fixed feature mappings • How to select the feature: – Univariate analysis • Pretend that only one variable, , exists • See how well it predicts the output y alone – Example: • differentially expressed features (or inputs) • Good separation in binary (case/control settings) )(xkkx CS 2750 Machine Learning Differentially expressed features • Scores for measuring the differential expression – T-Test score (Baldi & Long) • Based on the test that two groups come from the same population – Fisher Score – AUROC score: Area under Receiver Operating Characteristic curve Problems: – if many random features, and not many instances we can learn from the features with a good differentially expressed score must arise – Techniques to reduce FDR (False discovery rate) and FWER (Family wise error). 2)(2)(2)(2)()(iiiiiFisher5 CS 2750 Machine Learning Feature filtering Other univariate scores: • Correlation coefficients – Measures linear dependences • Mutual information • Univariate assumptions: – Only one feature and its effect on y is incorporated in the mutual information score – Effects of two features on y are independent • What to do if the combination of features gives the best prediction? )(~)(~),(~log),(~),(2iyPjPiyjPiyjPyIkkikjk)()(),(),(yVarVaryCovykkk CS 2750 Machine Learning Feature selection: dependent features Filtering with dependent features • Let be a current set of features (starting from complete set) • We can remove feature from it when: • Repeat removals until the probabilities differ. Problem: how to compute/estimate ? Solution: make some simplifying assumption about the underlying probabilistic model • Example: use a Naïve Bayes • Advantage: speed, modularity, applied before classification • Disadvantage: may not be as accurate φ)|(~)\|(~φφ yPyPkfor all values of yk,)(xk)|(~),\|(~φφ yPyPk6 CS 2750 Machine Learning Feature selection: wrappers Wrapper approach: • The feature selection is driven by the prediction accuracy of the classifier (regressor) we actually want to built How to find the appropriate feature set? • For d binary features there are 2d different feature subsets • Idea: Greedy search in the space of classifiers – Gradually add features improving most the quality score – Gradually remove features that effect the accuracy the least – Score should reflect the accuracy of the classifier (error) and also prevent overfit • Standard way to measure the quality: – Internal cross-validation (m-fold cross validation) CS 2750 Machine Learning Internal cross-validation • Split train set: to internal train and test sets • Internal train set: train different models (defined e.g. on different subsets of features) • Internal test set/s: estimate the generalization error and select the best model among possible models • Internal cross-validation (m-fold): – Divide the train data into m equal partitions (of size N/m) – Hold out one partition for validation, train the classifiers on the rest of data – Repeat such that every partition is held out once – The estimate of the generalization error of the learner is the mean of errors of on all partitions7 CS 2750 Machine Learning Feature selection:
View Full Document