Outline Reducing Data Dimension Feature selection Single feature scoring criteria Search strategies Required reading Bishop chapter 3 6 8 6 Recommended reading Unsupervised dimension reduction using all features Wall et al 2003 Principle Components Analysis Singular Value Decomposition Independent components analysis Machine Learning 10 701 November 2005 Tom M Mitchell Carnegie Mellon University Supervised dimension reduction Fisher Linear Discriminant Hidden layers of Neural Networks Dimensionality Reduction Why Learning a target function from data where some features are irrelevant reduce variance improve accuracy Supervised Feature Selection Wish to visualize high dimensional data Sometimes have data whose intrinsic dimensionality is smaller than the number of features used to describe it recover intrinsic dimension Scoring Individual Features Xi Supervised Feature Selection Problem Wish to learn f X Y where X X1 XN But suspect not all Xi are relevant Approach Preprocess data to select only a subset of the Xi Score each feature or subsets of features Common scoring methods Training or cross validated accuracy of single feature classifiers fi Xi Y Estimated mutual information between Xi and Y How Search for useful subset of features to represent data How 2 statistic to measure independence between Xi and Y Domain specific criteria Text Score stop words the of as zero fMRI Score voxel by T test for activation versus rest condition 1 Choosing Set of Features to learn F X Y Choosing Set of Features Common methods Common methods Forward1 Choose the n features with the highest scores Backward1 Start with all features delete the n with lowest scores Forward2 Choose single highest scoring feature Xk Rescore all features conditioned on the set of already selected features Backward2 Start with all features score each feature conditioned on assumption that all others are included Then E g Score Xi Xk I Xi Y Xk E g Score Xi Xk Accuracy predicting Y from Xi and Xk Remove feature with the lowest conditioned score Rescore all features conditioned on the new reduced feature set Repeat Repeat calculating new scores on each iteration conditioning on set of selected features Feature Selection Text Classification Approximately 105 words in English Rogati Yang 2002 Impact of Feature Selection on Classification of fMRI Data Pereira et al 2005 Accuracy classifying category of word read by subject IG information gain chi 2 DF doc frequency Voxels scored by p value of regression to predict voxel value from the task Summary Supervised Feature Selection Approach Preprocess data to select only a subset of the Xi Score each feature Mutual information prediction accuracy Find useful subset of features based on their scores Greedy addition of features to pool Greedy deletion of features from pool Considered independently or in context of other selected features Unsupervised Dimensionality Reduction Always do feature selection using training set only not test set Often use nested cross validation loop Outer loop to get unbiased estimate of final classifier accuracy Inner loop to test the impact of selecting features 2 Unsupervised mapping to lower dimension Differs from feature selection in two ways Principle Components Analysis Idea Given data points in d dimensional space project into lower dimensional space while preserving as much information as possible Instead of choosing subset of features create new features dimensions defined as functions over all features E g find best planar approximation to 3D data E g find best planar approximation to 104 D data In particular choose projection that minimizes the squared error in reconstructing original data Don t consider class labels just the data points PCA PCA Find Projections to Minimize Reconstruction Error u1 u2 x2 PCA given M d Find Assume data is set of d dimensional vectors where nth vector is that minimizes We can represent these in terms of any d orthogonal basis vectors x1 where u1 u2 PCA given M d Find Note we get zero error if M d Therefore x2 that minimizes This minimized when ui is eigenvector of i e when where Mean Covariance matrix x1 PCA u1 u2 PCA Example x2 Minimize Eigenvector of Eigenvalue x1 PCA algorithm 1 1 X Create N x d data matrix with one row vector xn per data point 2 X subtract mean x from each row vector xn in X 3 covariance matrix of X mean First eigenvector Second eigenvector 4 Find eigenvectors and eigenvalues of 5 PC s the M eigenvectors with largest eigenvalues 3 PCA Example Very Nice When Initial Dimension Not Too Big Reconstructed data using only first eigenvector M 1 What if very large dimensional data e g Images d 10 4 Problem Covariance matrix is size d x d mean First eigenvector d 104 108 Second eigenvector Singular Value Decomposition SVD to the rescue pretty efficient algs available including Matlab SVD some implementations find just top N eigenvectors SVD Singular Value Decomposition To generate principle components Data X one row per data point US gives coordinates of rows of X in the space of principle components S is diagonal Sk Sk 1 Sk2 is kth largest eigenvalue Rows of VT are unit length eigenvectors of XTX Subtract mean from each data point to create zero centered data Create matrix X with one row vector per zero centered data point Solve SVD X USVT Output Principle components columns of V rows of VT Eigenvectors in V are sorted from largest to smallest eigenvalues S is diagonal with sk2 giving eigenvalue for kth eigenvector If cols of X have zero mean then XTX c and eigenvects are the Principle Components from Wall et al 2003 Singular Value Decomposition To project a point column vector x into PC coordinates VT x If xi is ith row of data matrix X then ith row of US VT xiT US T VT XT To project a column vector x to M dim Principle Components subspace take just the first M coordinates of VT x Independent Components Analysis PCA seeks directions Y1 YM in feature space X that minimize reconstruction error ICA seeks directions Y1 YM that are most statistically independent I e that minimize I Y the mutual information between the Yj Which maximizes their departure from Gaussianity 4 Independent Components Analysis ICA with independent spatial components ICA seeks to minimize I Y the mutual information between the Yj Example Blind source separation Original features are microphones at a cocktail party Each receives sounds from multiple people speaking ICA outputs directions that correspond to individual speakers 1
View Full Document