Mathematics and Interpretation of Principal Components 36 350 Data Mining 24 September 2008 1 Mathematics of Principal Components There are several ways of deriving the principal components mathematically The simplest one as I mentioned last time is by finding the projection which maximizes the variance It might sound more plausible to look for the projection with the smallest average mean squared distance between the original vectors and their projections on to the principal components this turns out to be equivalent to maximizing the variance Throughout assume that the data have been centered so that every feature has mean 0 If we write the standardized data in a matrix X where rows are objects and columns are features then X T X nV where V is the covariance matrix of the data You should check that last statement 1 1 Minimizing Projection Residuals We ll start by looking for a one dimensional projection That is we have pdimensional feature vectors and we want to project them on to a line through the origin We can specify the line by a unit vector along it w and then the projection of a data vector x i on to the line is x i w which is a scalar Sanity check this gives us the right answer when we project on to one of the coordinate axes This is the distance of the projection from the origin the actual coordinate in p dimensional space is x i w w The mean of the projections will be zero because the mean of the vectors x i is zero n n 1X 1X x i w w xi w w 1 n i 1 n i 1 If we try to use our projected or image vectors instead of our original vectors there will be some error because in general the images do not coincide with the original vectors When do they coincide The difference is the error or residual of the projection How big is it For any one vector say x i it s 2 kx i w x i wk kx i k2 2 w x i w x i kwk 2 2 2 kx i k 2 w x i 1 1 2 3 This is the same trick used to compute distance matrices in the solution to the first homework it s really just the Pythagorean theorem Add those residuals up across all the vectors RSS w n X 2 kx i k2 2 w x i 1 4 i 1 n n X 2 2 kx i k n X 2 w x i 5 i 1 i 1 The term in the big parenthesis doesn t depend on w so it doesn t matter for trying to minimize the residual sum of squares To make RSS small what we must do is make the term we subtract from it big i e we want to maximize n X 2 w x i i 1 Equivalently since n doesn t depend on w we want to maximize n 1X 2 w x i n i 1 2 which we can see is the sample mean of w x i The mean of a square is always equal to the square of the mean plus the variance n 1X 2 w x i n i 1 1 x i w n 2 Var w x i 6 Since we ve just seen that the mean of the projections is zero minimizing the residual sum of squares turns out to be equivalent to maximizing the variance of the projections Of course in general we don t want to project on to just one vector but on to multiple principal components If those components are orthogonal and have the unit vectors w 1 w 2 w k then the image of xi is its projection into the space spanned by these vectors k X x i w j w j j 1 The mean of the projection on to each component is still zero If we go through the same algebra for the residual sum of squares it turns out that the crossterms between different components all cancel out and we are left with trying to maximize the sum of the variances of the projections on to the components Exercise Do this algebra 2 1 2 Maximizing Variance Accordingly let s maximize the variance Writing out all the summations grows tedious so let s do our algebra in matrix form If we stack our n data vectors into an n p matrix X then the projections are given by Xw which is an n 1 matrix The variance is 2 w 1X 2 x i w n i 1 T Xw Xw n 1 T T w X Xw n XT X w wT n wT Vw 7 8 9 10 11 2 We want to chose a unit vector w so as to maximize w To do this we need to make sure that we only look at unit vectors we need to constrain the maximization The constraint is that w w 1 or wT w 1 This needs a brief excursion into constrained optimization We start with a function f w that we want to maximize Here that function is wT V w We also have an equality constraint g w c Here g w wT w and c 1 We re arrange the constraint equation so its righthand side is zero g w c 0 We now add an extra variable to the problem the Lagrange multiplier and consider u w f w g w c This is our new objective function so we differentiate with respect to both arguments and set the derivatives equal to zero u w u 0 f g w w 12 0 g w c 13 That is maximizing with respect to gives us back our constraint equation g w c At the same time when we have the constraint satisfied our new objective function is the same as the old one If we had more than one constraint we would just need more Lagrange multipliers 1 For our projection problem u u w Vw wT Vw wT w 1 2Vw 2 w 0 w 14 15 16 1 To learn more about Lagrange multipliers read Boas 1983 or more compactly Klein 2001 3 Thus desired vector w is an eigenvector of the covariance matrix V and the maximizing vector will be the one associated with the largest eigenvalue This is good news because finding eigenvectors is something which can be done comparatively rapidly see Principles of Data Mining p 81 and because eigenvectors have many nice mathematical properties V is a p p matrix so it will have p different eigenvectors V is a covariance matrix so it is symmetric and linear algebra tells us that the eigenvectors must be orthogonal to one another The second principal component remember is the direction with the most variance which is orthogonal to the first principal component Thus the second principal compoent will be the eigenvector of V corresponding to the second largest eigenvalue and so on Because it is orthogonal to the first eigenvector their …
View Full Document
Unlocking...