DOC PREVIEW
CMU CS 10701 - Reducing Data Dimension

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Reducing Data DimensionMachine Learning 10-701November 2005Tom M. MitchellCarnegie Mellon UniversityRequired reading: • Bishop, chapter 3.6, 8.6Recommended reading:• Wall et al., 2003Outline• Feature selection– Single feature scoring criteria– Search strategies• Unsupervised dimension reduction using all features– Principle Components Analysis– Singular Value Decomposition– Independent components analysis• Supervised dimension reduction– Fisher Linear Discriminant– Hidden layers of Neural NetworksDimensionality ReductionWhy?• Learning a target function from data where some features are irrelevant - reduce variance, improve accuracy• Wish to visualize high dimensional data• Sometimes have data whose “intrinsic” dimensionality is smaller than the number of features used to describe it -recover intrinsic dimensionSupervised Feature SelectionSupervised Feature SelectionProblem: Wish to learn f: X Æ Y, where X=<X1, …XN>But suspect not all Xiare relevantApproach: Preprocess data to select only a subset of the Xi• Score each feature, or subsets of features–How?• Search for useful subset of features to represent data–How?Scoring Individual Features XiCommon scoring methods:• Training or cross-validated accuracy of single-feature classifiers fi: XiÆY• Estimated mutual information between Xiand Y : • χ2statistic to measure independence between Xiand Y• Domain specific criteria– Text: Score “stop” words (“the”, “of”, …) as zero– fMRI: Score voxel by T-test for activation versus rest condition–…2Choosing Set of Features to learn F: XÆYCommon methods:Forward1: Choose the n features with the highest scoresForward2:– Choose single highest scoring feature Xk– Rescore all features, conditioned on the set of already-selected features• E.g., Score(Xi| Xk) = I(Xi,Y |Xk) • E.g, Score(Xi| Xk) = Accuracy(predicting Y from Xiand Xk)– Repeat, calculating new scores on each iteration, conditioning on set of selected featuresChoosing Set of FeaturesCommon methods:Backward1: Start with all features, delete the n with lowest scoresBackward2: Start with all features, score each feature conditioned on assumption that all others are included. Then:– Remove feature with the lowest (conditioned) score– Rescore all features, conditioned on the new, reduced feature set– RepeatFeature Selection: Text Classification[Rogati&Yang, 2002]IG=information gain, chi= χ2, DF=doc frequency, Approximately 105words in EnglishImpact of Feature Selection on Classification of fMRI Data[Pereira et al., 2005]Accuracy classifying category of word read by subjectVoxels scored by p-value of regression to predict voxel value from the taskSummary: Supervised Feature SelectionApproach: Preprocess data to select only a subset of the Xi• Score each feature– Mutual information, prediction accuracy, …• Find useful subset of features based on their scores– Greedy addition of features to pool– Greedy deletion of features from pool– Considered independently, or in context of other selected featuresAlways do feature selection using training set only (not test set!)– Often use nested cross-validation loop:• Outer loop to get unbiased estimate of final classifier accuracy• Inner loop to test the impact of selecting featuresUnsupervised Dimensionality Reduction3Unsupervised mapping to lower dimensionDiffers from feature selection in two ways:• Instead of choosing subset of features, create new features (dimensions) defined as functions over all features • Don’t consider class labels, just the data pointsPrinciple Components Analysis• Idea: – Given data points in d-dimensional space, project into lower dimensional space while preserving as much information as possible• E.g., find best planar approximation to 3D data• E.g., find best planar approximation to 104D data– In particular, choose projection that minimizes the squared error in reconstructing original dataPCA: Find Projections to Minimize Reconstruction ErrorAssume data is set of d-dimensional vectors, where nth vector isWe can represent these in terms of any d orthogonal basis vectors x1x2u2u1PCA: given M<d. Find that minimizeswhere Mean PCAx1x2u2u1Note we get zero error if M=d.Therefore, PCA: given M<d. Find that minimizeswhere Covariance matrix:This minimized when uiis eigenvector of Σ, i.e., when:PCAx1x2u2u1MinimizeEigenvector of ΣEigenvaluePCA algorithm 1:1. X Å Create N x d data matrix, with one row vector xnper data point2. X Å subtract mean x from each row vector xnin X3. Σ Å covariance matrix of X4. Find eigenvectors and eigenvaluesof Σ5. PC’s Å the M eigenvectors with largest eigenvaluesPCA ExamplemeanFirst eigenvectorSecond eigenvector4PCA ExamplemeanFirst eigenvectorSecond eigenvectorReconstructed data using only first eigenvector (M=1)Very Nice When Initial Dimension Not Too BigWhat if very large dimensional data?• e.g., Images (d ≥ 10^4)Problem:• Covariance matrix Σ is size (d x d)•d=104Æ | Σ | = 108Singular Value Decomposition (SVD) to the rescue!• pretty efficient algs available, including Matlab SVD• some implementations find just top N eigenvectors[from Wall et al., 2003]SVDData X, one row per data pointRows of VTare unit length eigenvectors of XTX. If cols of X have zero mean, then XTX = c Σand eigenvects are the Principle ComponentsS is diagonal, Sk> Sk+1, Sk2is kthlargest eigenvalueUS gives coordinates of rows of Xin the space of principle componentsSingular Value DecompositionTo generate principle components:• Subtract mean from each data point, to create zero-centered data• Create matrix X with one row vector per (zero centered) data point•Solve SVD: X = USVT• Output Principle components: columns of V (= rows of VT)– Eigenvectors in V are sorted from largest to smallest eigenvalues– S is diagonal, with sk2giving eigenvalue for kth eigenvectorSingular Value DecompositionTo project a point (column vector x) into PC coordinates:VTxIf xiis ithrow of data matrix X, then •(ithrow of US) = VTxiT• (US)T= VTXTTo project a column vector x to M dim Principle Components subspace, take just the first M coordinates of VTxIndependent Components Analysis• PCA seeks directions <Y1…YM> in feature space X that minimize reconstruction error• ICA seeks directions <Y1…YM> that are most statistically independent. I.e., that minimize I(Y), the mutual information between the Yj:Which


View Full Document

CMU CS 10701 - Reducing Data Dimension

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Reducing Data Dimension
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Reducing Data Dimension and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Reducing Data Dimension 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?