1Reducing Data DimensionMachine Learning 10-701November 2005Tom M. MitchellCarnegie Mellon UniversityRequired reading: • Bishop, chapter 3.6, 8.6Recommended reading:• Wall et al., 2003Outline• Feature selection– Single feature scoring criteria– Search strategies• Unsupervised dimension reduction using all features– Principle Components Analysis– Singular Value Decomposition– Independent components analysis• Supervised dimension reduction– Fisher Linear Discriminant– Hidden layers of Neural NetworksDimensionality ReductionWhy?• Learning a target function from data where some features are irrelevant - reduce variance, improve accuracy• Wish to visualize high dimensional data• Sometimes have data whose “intrinsic” dimensionality is smaller than the number of features used to describe it -recover intrinsic dimensionSupervised Feature SelectionSupervised Feature SelectionProblem: Wish to learn f: X Æ Y, where X=<X1, …XN>But suspect not all Xiare relevantApproach: Preprocess data to select only a subset of the Xi• Score each feature, or subsets of features–How?• Search for useful subset of features to represent data–How?Scoring Individual Features XiCommon scoring methods:• Training or cross-validated accuracy of single-feature classifiers fi: XiÆY• Estimated mutual information between Xiand Y : • χ2statistic to measure independence between Xiand Y• Domain specific criteria– Text: Score “stop” words (“the”, “of”, …) as zero– fMRI: Score voxel by T-test for activation versus rest condition–…2Choosing Set of Features to learn F: XÆYCommon methods:Forward1: Choose the n features with the highest scoresForward2:– Choose single highest scoring feature Xk– Rescore all features, conditioned on the set of already-selected features• E.g., Score(Xi| Xk) = I(Xi,Y |Xk) • E.g, Score(Xi| Xk) = Accuracy(predicting Y from Xiand Xk)– Repeat, calculating new scores on each iteration, conditioning on set of selected featuresChoosing Set of FeaturesCommon methods:Backward1: Start with all features, delete the n with lowest scoresBackward2: Start with all features, score each feature conditioned on assumption that all others are included. Then:– Remove feature with the lowest (conditioned) score– Rescore all features, conditioned on the new, reduced feature set– RepeatFeature Selection: Text Classification[Rogati&Yang, 2002]IG=information gain, chi= χ2, DF=doc frequency, Approximately 105words in EnglishImpact of Feature Selection on Classification of fMRI Data[Pereira et al., 2005]Accuracy classifying category of word read by subjectVoxels scored by p-value of regression to predict voxel value from the taskSummary: Supervised Feature SelectionApproach: Preprocess data to select only a subset of the Xi• Score each feature– Mutual information, prediction accuracy, …• Find useful subset of features based on their scores– Greedy addition of features to pool– Greedy deletion of features from pool– Considered independently, or in context of other selected featuresAlways do feature selection using training set only (not test set!)– Often use nested cross-validation loop:• Outer loop to get unbiased estimate of final classifier accuracy• Inner loop to test the impact of selecting featuresUnsupervised Dimensionality Reduction3Unsupervised mapping to lower dimensionDiffers from feature selection in two ways:• Instead of choosing subset of features, create new features (dimensions) defined as functions over all features • Don’t consider class labels, just the data pointsPrinciple Components Analysis• Idea: – Given data points in d-dimensional space, project into lower dimensional space while preserving as much information as possible• E.g., find best planar approximation to 3D data• E.g., find best planar approximation to 104D data– In particular, choose projection that minimizes the squared error in reconstructing original dataPCA: Find Projections to Minimize Reconstruction ErrorAssume data is set of d-dimensional vectors, where nth vector isWe can represent these in terms of any d orthogonal basis vectors x1x2u2u1PCA: given M<d. Find that minimizeswhere Mean PCAx1x2u2u1Note we get zero error if M=d.Therefore, PCA: given M<d. Find that minimizeswhere Covariance matrix:This minimized when uiis eigenvector of Σ, i.e., when:PCAx1x2u2u1MinimizeEigenvector of ΣEigenvaluePCA algorithm 1:1. X Å Create N x d data matrix, with one row vector xnper data point2. X Å subtract mean x from each row vector xnin X3. Σ Å covariance matrix of X4. Find eigenvectors and eigenvaluesof Σ5. PC’s Å the M eigenvectors with largest eigenvaluesPCA ExamplemeanFirst eigenvectorSecond eigenvector4PCA ExamplemeanFirst eigenvectorSecond eigenvectorReconstructed data using only first eigenvector (M=1)Very Nice When Initial Dimension Not Too BigWhat if very large dimensional data?• e.g., Images (d ≥ 10^4)Problem:• Covariance matrix Σ is size (d x d)•d=104Æ | Σ | = 108Singular Value Decomposition (SVD) to the rescue!• pretty efficient algs available, including Matlab SVD• some implementations find just top N eigenvectors[from Wall et al., 2003]SVDData X, one row per data pointRows of VTare unit length eigenvectors of XTX. If cols of X have zero mean, then XTX = c Σand eigenvects are the Principle ComponentsS is diagonal, Sk> Sk+1, Sk2is kthlargest eigenvalueUS gives coordinates of rows of Xin the space of principle componentsSingular Value DecompositionTo generate principle components:• Subtract mean from each data point, to create zero-centered data• Create matrix X with one row vector per (zero centered) data point•Solve SVD: X = USVT• Output Principle components: columns of V (= rows of VT)– Eigenvectors in V are sorted from largest to smallest eigenvalues– S is diagonal, with sk2giving eigenvalue for kth eigenvectorSingular Value DecompositionTo project a point (column vector x) into PC coordinates:VTxIf xiis ithrow of data matrix X, then •(ithrow of US) = VTxiT• (US)T= VTXTTo project a column vector x to M dim Principle Components subspace, take just the first M coordinates of VTxIndependent Components Analysis• PCA seeks directions <Y1…YM> in feature space X that minimize reconstruction error• ICA seeks directions <Y1…YM> that are most statistically independent. I.e., that minimize I(Y), the mutual information between the Yj:Which
View Full Document