Principal Components Mathematics Example Interpretation 36 350 Data Mining 18 September 2009 Reading Section 3 6 in the textbook Contents 1 Mathematics of Principal Components 1 1 Minimizing Projection Residuals 1 2 Maximizing Variance 1 3 More Geometry Back to the Residuals 1 2 3 5 2 Example Cars 2 1 A Recipe 6 8 3 PCA Cautions 10 At the end of the last lecture I set as our goal to find ways of reducing the dimensionality of our data by means of linear projections and of choosing projections which in some sense respect the structure of the data I further asserted that there was one way of doing this which was far more useful and important than others called principal components analysis where respecting structure means preserving variance This lecture will explain that explain how to do PCA show an example and describe some of the issues that come up in interpreting the results PCA has been rediscovered many times in many fields so it is also known as the Karhunen Loe ve transformation the Hotelling transformation the method of empirical orthogonal functions and singular value decomposition1 We will call it PCA 1 Mathematics of Principal Components We start with p dimensional feature vectors and want to summarize them by projecting down into a q dimensional subspace Our summary will be the pro1 Strictly speaking singular value decomposition is a matrix algebra trick which is used in the most common algorithm for PCA 1 jection of the original vectors on to q directions the principal components which span the sub space There are several equivalent ways of deriving the principal components mathematically The simplest one is by finding the projections which maximize the variance The first principal component is the direction in feature space along which projections have the largest variance The second principal component is the direction which maximizes variance among all directions orthogonal to the first The k th component is the variance maximizing direction orthogonal to the previous k 1 components There are p principal components in all Rather than maximizing variance it might sound more plausible to look for the projection with the smallest average mean squared distance between the original vectors and their projections on to the principal components this turns out to be equivalent to maximizing the variance Throughout assume that the data have been centered so that every feature has mean 0 If we write the centered data in a matrix X where rows are objects and columns are features then XT X nV where V is the covariance matrix of the data You should check that last statement 1 1 Minimizing Projection Residuals We ll start by looking for a one dimensional projection That is we have pdimensional feature vectors and we want to project them on to a line through the origin We can specify the line by a unit vector along it w and then the projection of a data vector x i on to the line is x i w which is a scalar Sanity check this gives us the right answer when we project on to one of the coordinate axes This is the distance of the projection from the origin the actual coordinate in p dimensional space is x i w w The mean of the projections will be zero because the mean of the vectors x i is zero n n 1X 1X x i w w xi w w 1 n i 1 n i 1 If we try to use our projected or image vectors instead of our original vectors there will be some error because in general the images do not coincide with the original vectors When do they coincide The difference is the error or residual of the projection How big is it For any one vector say x i it s 2 kx i w x i wk kx i k2 2 w x i w x i kwk 2 2 2 kx i k 2 w x i 1 2 3 This is the same trick used to compute distance matrices in the solution to the first homework it s really just the Pythagorean theorem Add those residuals up across all the vectors RSS w n X 2 kx i k2 2 w x i 1 i 1 2 4 n X n 2 kx i k 2 i 1 n X 2 w x i 5 i 1 The term in the big parenthesis doesn t depend on w so it doesn t matter for trying to minimize the residual sum of squares To make RSS small what we must do is make the second sum big i e we want to maximize n X 2 w x i 6 i 1 Equivalently since n doesn t depend on w we want to maximize n 1X 2 w x i n i 1 7 2 which we can see is the sample mean of w x i The mean of a square is always equal to the square of the mean plus the variance 2 n n 1X 1X 2 w x i x i w Var w x i 8 n i 1 n i 1 Since we ve just seen that the mean of the projections is zero minimizing the residual sum of squares turns out to be equivalent to maximizing the variance of the projections Of course in general we don t want to project on to just one vector but on to multiple principal components If those components are orthogonal and have the unit vectors w 1 w 2 w k then the image of xi is its projection into the space spanned by these vectors k X x i w j w j 9 j 1 The mean of the projection on to each component is still zero If we go through the same algebra for the residual sum of squares it turns out that the crossterms between different components all cancel out and we are left with trying to maximize the sum of the variances of the projections on to the components Exercise Do this algebra 1 2 Maximizing Variance Accordingly let s maximize the variance Writing out all the summations grows tedious so let s do our algebra in matrix form If we stack our n data vectors into an n p matrix X then the projections are given by Xw which is an n 1 matrix The variance is 1X 2 2 w x i w 10 n i 3 1 T Xw Xw n 1 T T w X Xw n XT X w wT n wT Vw 11 12 13 14 2 We want to chose a unit vector w so as to maximize w To do this we need to make sure that we only look at unit vectors we need to constrain the maximization The constraint is that w w 1 or wT w 1 This needs a brief excursion into constrained optimization We start with a function f w that we want to maximize Here that function is wT V w We also have an equality constraint g w c Here g w wT …
View Full Document
Unlocking...