CMU STA 36402-36608 - lecture-17 - D2564888

Home> Schools> Carnegie Mellon University> Statistics (STA) > STA 36402-36608> lecture-17

DOC PREVIEW

CMU STA 36402-36608 - lecture-17

School name Carnegie Mellon University

Course Sta 36402-36608- Undergraduate Advanced Data Analysis

Pages 18

This preview shows page 1-2-3-4-5-6 out of 18 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Mathematics of Principal ComponentsMinimizing Projection ResidualsMaximizing VarianceMore Geometry; Back to the ResidualsStatistical Inference, or NotExample: CarsLatent Semantic AnalysisPrincipal Components of the New York TimesPCA for VisualizationPCA Cautions17. Principal Components Analysis36-402, Advanced Data Analysis24 March 2011AbstractExercise: Step through the pca.R file on the class website. Thenreplicate the analysis of the cars data given below.Contents1 Mathematics of Principal Components 21.1 Minimizing Projection Residuals . . . . . . . . . . . . . . . . . . 21.2 Maximizing Variance . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 More Geometry; Back to the Residuals . . . . . . . . . . . . . . . 51.4 Statistical Inference, or Not . . . . . . . . . . . . . . . . . . . . . 62 Example: Cars 73 Latent Semantic Analysis 103.1 Principal Components of the New York Times . . . . . . . . . . 114 PCA for Visualization 145 PCA Cautions 15Principal components analysis (PCA) is one of a family of techniques fortaking high-dimensional data, and using the dependencies between the variablesto represent it in a more tractable, lower-dimensional form, without losing toomuch information. PCA is one of the simplest and most robust ways of doingsuch dimensionality reduction. It is also one of the oldest, and has beenrediscovered many times in many fields, so it is also known as the Karhunen-Lo`eve transformation, the Hotelling transformation, the method of empiricalorthogonal functions, and singular value decomposition1. We will call it PCA.1Strictly speaking, singular value decomposition is a matrix algebra trick which is used inthe most common algorithm for PCA.11 Mathematics of Principal ComponentsWe start with p-dimensional feature vectors, and want to summarize them byprojecting down into a q-dimensional subspace. Our summary will be the pro-jection of the original vectors on to q directions, the principal components,which span the sub-space.There are several equivalent ways of deriving the principal components math-ematically. The simplest one is by finding the projections which maximize thevariance. The first principal component is the direction in feature space alongwhich projections have the largest variance. The second principal componentis the direction which maximizes variance among all directions orthogonal tothe first. The kthcomponent is the variance-maximizing direction orthogonalto the previous k − 1 components. There are p principal components in all.Rather than maximizing variance, it might sound more plausible to look forthe projection with the smallest average (mean-squared) distance between theoriginal vectors and their projections on to the principal components; this turnsout to be equivalent to maximizing the variance.Throughout, assume that the data have been “centered”, so that every fea-ture has mean 0. If we write the centered data in a matrix X, where rows areobjects and columns are features, then XTX = nV, where V is the covariancematrix of the data. (You should check that last statement!)1.1 Minimizing Projection ResidualsWe’ll start by looking for a one-dimensional projection. That is, we have p-dimensional feature vectors, and we want to project them on to a line throughthe origin. We can specify the line by a unit vector along it, ~w, and thenthe projection of a data vector ~xion to the line is ~xi· ~w, which is a scalar.(Sanity check: this gives us the right answer when we project on to one ofthe coordinate axes.) This is the distance of the projection from the origin;the actual coordinate in p-dimensional space is (~xi· ~w) ~w. The mean of theprojections will be zero, because the mean of the vectors ~xiis zero:1nnXi=1( ~xi· ~w)~w = 1nnXi=1xi!· ~w!~w (1)If we try to use our projected or image vectors instead of our original vectors,there will be some error, because (in general) the images do not coincide withthe original vectors. (When do they coincide?) The difference is the error orresidual of the projection. How big is it? For any one vector, say ~xi, it’sk ~xi− ( ~w · ~xi) ~wk2= k~xik2− 2( ~w · ~xi)( ~w · ~xi) + k ~wk2(2)= k~xik2− 2( ~w · ~xi)2+ 1 (3)(This is the same trick used to compute distance matrices in the solution to thefirst homework; it’s really just the Pythagorean theorem.) Add those residuals2up across all the vectors:RSS( ~w) =nXi=1k ~xik2− 2( ~w · ~xi)2+ 1 (4)= n +nXi=1k ~xik2!− 2nXi=1( ~w · ~xi)2(5)The term in the big parenthesis doesn’t depend on ~w, so it doesn’t matter fortrying to minimize the residual sum-of-squares. To make RSS small, what wemust do is make the second sum big, i.e., we want to maximizenXi=1( ~w · ~xi)2(6)Equivalently, since n doesn’t depend on ~w, we want to maximize1nnXi=1( ~w · ~xi)2(7)which we can see is the sample mean of (~w · ~xi)2. The mean of a square is alwaysequal to the square of the mean plus the variance:1nnXi=1( ~w · ~xi)2= 1nnXi=1~xi· ~w!2+ Var [~w · ~xi] (8)Since we’ve just seen that the mean of the projections is zero, minimizing theresidual sum of squares turns out to be equivalent to maximizing the varianceof the projections.(Of course in general we don’t want to project on to just one vector, buton to multiple principal components. If those components are orthogonal andhave the unit vectors ~w1, ~w2, . . . ~wk, then the image of xiis its projection intothe space spanned by these vectors,kXj=1( ~xi· ~wj) ~wj(9)The mean of the projection on to each component is still zero. If we go throughthe same algebra for the residual sum of squares, it turns out that the cross-terms between different components all cancel out, and we are left with tryingto maximize the sum of the variances of the projections on to the components.Exercise: Do this algebra.)1.2 Maximizing VarianceAccordingly, let’s maximize the variance! Writing out all the summations growstedious, so let’s do our algebra in matrix form. If we stack our n data vectors3into an n × p matrix, X, then the projections are given by Xw, which is ann × 1 matrix. The variance isσ2~w=1nXi( ~xi· ~w)2(10)=1n(Xw)T(Xw) (11)=1nwTXTXw (12)= wTXTXnw (13)= wTVw (14)We want to chose a unit vector ~w so as to maximize σ2~w. To do this, we needto make sure that we only look at unit vectors — we need to constrain themaximization. The constraint is that ~w · ~w = 1, or wTw = 1. This needs a briefexcursion into constrained optimization.We start with a function f (w) that we want to maximize.

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5-6 out of 18 pages.

CMU STA 36402-36608 - lecture-17

Sign up for free to view:

Please select your school