Principal Component Analysis CS498Today’s lecture • Adaptive Feature Extraction • Principal Component Analysis – How, why, when, whichA dual goal • Find a good representation – The features part • Reduce redundancy in the data – A side effect of “proper” featuresExample case • Describe this inputWhat about now?A “good feature” • “Simplify” the explanation of the input – Represent repeating patterns – When defined makes the input simpler • How do we define these abstract qualities? – On to the math …Linear features Z = WX= samples ⟶ samples ⟶ features ⟶ dimensions ⟶ features ⟶ dimensions ⟶ Feature Matrix Input Matrix Weight MatrixA 2D case Matrix representation of data Scatter plot of same data Z = WX ==zT1zT2⎡⎣⎢⎢⎢⎢⎤⎦⎥⎥⎥⎥=wT1wT2⎡⎣⎢⎢⎢⎢⎤⎦⎥⎥⎥⎥xT1xT2⎡⎣⎢⎢⎢⎢⎤⎦⎥⎥⎥⎥x1 x2 x1 x2 …Defining a goal • Desirable feature features – Give “simple” weights – Avoid feature similarity • How do we define these?One way to proceed • “Simple weights” – Minimize relation of the two dimensions • “Feature similarity” – Same thing!One way to proceed • “Simple weights” – Minimize relation of the two dimensions – Decorrelate: • “Feature similarity” – Same thing! – Decorrelate: zT1z2= 0wT1w2= 0Diagonalizing the covariance • Covariance matrix • Diagonalizing the covariance suppresses cross-dimensional co-activity – if z1 is high, z2 won’t be, etc Cov z1,z2( )=zT1z1zT1z2zT2z1zT2z2⎡⎣⎢⎢⎢⎢⎤⎦⎥⎥⎥⎥/ NCov z1,z2( )=zT1z1zT1z2zT2z1zT2z2⎡⎣⎢⎢⎢⎢⎤⎦⎥⎥⎥⎥/ N =1 00 1⎡⎣⎢⎢⎤⎦⎥⎥= IProblem definition • For a given input • Find a feature matrix • So that the weights decorrelate WX( )WX( )T= N I ⇒ ZZT= N IWXHow do we solve this? • Any ideas? WX( )WX( )T= N ISolving for diagonalization WX( )WX( )T= N I ⇒⇒ WXXTWT= N I ⇒⇒ WCov X( )WT= ISolving for diagonalization • Covariance matrices are positive definite – Therefore symmetric • have orthogonal eigenvectors and real eigenvalues – and are factorizable by: – Where U has eigenvectors of A in its columns – Λ=diag(λi), where λi are the eigenvalues of A UTAU = ΛSolving for diagonalization • The solution is a function of the eigenvectors U and eigenvalues Λ of Cov(X) WCov X( )WT= I ⇒⇒ W =λ100 λ2⎡⎣⎢⎢⎢⎢⎤⎦⎥⎥⎥⎥−1UTSo what does it do? • Input data covariance: • Extracted feature matrix: • Weights covariance: Cov X( )≈14.9 0.050.05 1.08⎡⎣⎢⎢⎤⎦⎥⎥Cov(WX) =1 00 1⎡⎣⎢⎢⎤⎦⎥⎥W ≈0.12 −30.4−8.17 −0.03⎡⎣⎢⎢⎤⎦⎥⎥/ NAnother solution • This is not the only solution to the problem • Consider this one: WX( )WX( )T= N I ⇒WXXTWT= N I ⇒W = XXT( )−1/2Another solution • This is not the only solution to the problem • Consider this one: • Similar but out of scope for now WX( )WX( )T= N I ⇒WXXTWT= N I ⇒W = XXT( )−1/2⇒W = US−1/2VT[U,S,V] = SVD XXT( )Decorrelation in pictures • An implicit Gaussian assumption – N-D data has N directions of variance Input DataUndoing the variance • The decorrelating matrix W contains two vectors that normalize the input’s variance Input DataResulting transform • Input gets scaled to a well behaved Gaussian with unit variance in all dimensions Transformed Data (feature weights) Input DataA more complex case • Having correlation between two dimensions – We still find the directions of maximal variance – But we also rotate in addition to scaling Transformed Data (feature weights) Input DataOne more detail • So far we considered zero-mean inputs – The transforming operation was a rotation • If the input mean is not zero bad things happen! – Make sure that your data is zero-mean! Transformed Data (feature weights) Input DataPrincipal Component Analysis • This transform is known as PCA – The features are the principal components • They are orthogonal to each other • And produce orthogonal (white) weights – Major tool in statistics • Removes dependencies from multivariate data • Also known as the KLT – Karhunen-Loeve transformA better way to compute PCA • The Singular Value Decomposition way • Relationship to eigendecomposition – In our case (covariance input A), U and S will hold the eigenvectors/values of A • Why the SVD? – More stable, more robust, fancy extensions [U,S,V] = SVD(A) ⇒ A = USVTPCA through the SVD = SVD samples ⟶ samples ⟶ samples ⟶ dimensions ⟶ features ⟶ dimensions ⟶ Feature Matrix Input Matrix Weight Matrix √eigenvalue matrix = SVD dimensions⟶ features ⟶ features ⟶ dimensions ⟶ samples ⟶ dimensions ⟶ Feature Matrix Input Covariance Weight Matrix Eigenvalue matrix features ⟶ features ⟶ features ⟶ features ⟶Dimensionality reduction • PCA is great for high dimensional data • Allows us to perform dimensionality reduction – Helps us find relevant structure in data – Helps us throw away things that won’t matterA simple example • Two very correlated dimensions – e.g. size and weight of fruit – One effective variable • PCA matrix here is: – Large variance between the two components • about two orders of magnitude W =−0.2 −0.13−13.7 28.2⎡⎣⎢⎢⎤⎦⎥⎥A simple example • Second principal component needs to be super-boosted to whiten the weights – maybe is it useless? • Keep only high variance – Throw away components with minor contributionsWhat is the number of dimensions? • If the input was M dimensional, how many dimensions do we keep? – No solid answer (estimators exist but are flaky) • Look at the singular/eigen-values – They will show the variance of each component, at some point it will be smallExample • Eigenvalues of 1200 dimensional video data – Little variance after component 30 – We don’t need to keep the rest of the data 0 10 20 30 40 50 6000.511.522.5x 105…So where are the features? • We strayed off-subject – What
View Full Document