UW-Madison CS 731 - Dimensionality Reduction - D1964755

Home> Schools> University of Wisconsin, Madison> (CS) > CS 731> Dimensionality Reduction

DOC PREVIEW

UW-Madison CS 731 - Dimensionality Reduction

School name University of Wisconsin, Madison

Course Cs 731- Advanced Artificial Intelligence (

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Principal Component Analysis (PCA)The Variance Preservation ViewThe Minimum Reconstruction Error ViewClassical Multidimensional Scaling (MDS)IsomapLocally Linear Embedding (LLE)Laplacian EigenmapsCS731 Spring 2011 Advanced Artificial IntelligenceDimensionality ReductionLecturer: Xiaojin Zhu [email protected] x1, . . . , xn∈ RD. It is convenient to assume that the points are centeredPixi= 0 (one can alwayscentered the data by subtracting the sample mean). We want to represent these points in some lowerdimensional space Rdwhere typically d  D.1 Principal Component Analysis (PCA)PCA can be justified in several ways.1.1 The Variance Preservation ViewLet’s consider a projection onto a line going through the origin. Such a line can be specified by a vectorw ∈ RD. The projection of x isw>xkwk. (1)For simplicity, let us consider w with unit length. The variance of the projected dataset is1nnXi=1(w>xi)2= w>Sw, (2)whereS =1nXixix>i(3)is the sample covariance matrix since we assume the dataset is centered. The goal of PCA (in this 1D case)is to find the w that maximizes the variance, in the hope that it maximally preserves the distinction amongpoints. This leads to the following optimization problemmaxww>Sw (4)s.t. kwk = 1. (5)Let’s solve it by forming the Lagrangianw>Sw + λ(1 − w>w). (6)The gradient w.r.t. w is∇ = 2Sw − 2λw . (7)Setting to zero, we find thatSw = λw , (8)i.e., the desired direction w is an eigenvector of S! But which one? Recall the projected variance isw>Sw = w>λw = λ, (9)1Dimensionality Reduction 2we see that we want λ to be the largest eigenvalue of S and w the corresponding e igenvector. In otherwords, let λ1, . . . , λnbe the eigenvalues of S in non-increasing order, and φ1, . . . , φnbe the correspondingeigenvectors. Then φ1is the maximum variance preserving direction, and the resulting variance is simplyλ1. This is PCA with d = 1: a D-dimensional point x is projected to a scalar φ>1x. Note that when S’stop eigenvalue has multiplicity larger than one, i.e., λ1= λ2, then PCA is not unique: any unit vector inspan(φ1, φ2) can be the PCA direction.If we want d > 1, it can be shown that we want to project x onto the first d eigenvectorsx → (φ>1x, . . . , φ>dx)>. (10)Recall that one can view φ1, . . . , φDas the D major-to-minor axes of an ellipsoid represented by the samplecovariance matrix (NB this does not assume that the underlying distribution is Gaussian). Clearly, if d = Dthen φ1. . . φDis a basis for RD, and this PCA projection amounts to a rotation of the coordinate system(align them with the eigenvectors) without any loss of information.1.2 The Minimum Reconstruction Error ViewUsing any orthonormal basis u1. . . uD, a training point xi(recall it has been centered) can be written asxi=DXj=1αijuj(11)whereαij= u>jxi. (12)Consider the d-term approximation to xi:ˆxi=dXj=1αijuj. (13)We want the approximation error to be small for all training points:1nnXi=1kˆxi− xik2=1nnXi=1kDXj=d+1αijujk2=1nnXi=1DXj=d+1α2ij(14)=1nnXi=1DXj=d+1u>jxix>iuj=DXj=d+1u>jSuj. (15)If d = D −1, i.e., we need to remove a single dimension, it is easy to s ee that uD= φDbecause φ>DSφD= λDis the smallest among all unit vectors. Similarly, the other dimensions to remove are subsequently theeigenvectors corresponding to the least eigenvalues.Finally, it should be pointed out that PCA is an unsupervised technique. It could go terribly wrongas a preprocessing step for classification, e.g., when the decision boundary is orthogonal to the smallesteigenvector.2 Classical Multidimensional Scaling (MDS)MDS is not a dimensionality reduction method, but rather an embedding me thod. We have n items but wedo not know their feature representation. Instead, we are given their pairwise squared distanceskxi− xjk2. (16)Dimensionality Reduction 3Our goal is to embed these items in Rd, i.e., finding a feature representation x1, . . . , xnsuch that it satisfiesthe given squared distances.We introduce the n × n centering matrixP = In×n−1nU (17)where U is the n × n all-1 m atrix. This matrix is called the centering matrix because when it applies to an × d matrix X where each row is a data point, the resultP X = [xi− µ], (18)is to subtract µ the mean of X from each row. In addition,P 1 = 0 (19)where 1 here is the all-1 vector. Now considerP XX>P = (P X)(P X)>(20)We see that the ij-th entry is (xi− µ)>(xj− µ), i.e., the inner product between the centered i-th and j-thitems. Furthermore, with this resultP [kxi− xjk2]P = P [x>ixi+ x>jxj− 2x>ixj]P (21)= P [x>ixi]P + P [x>jxj]P − 2P [x>ixj]P (22)= −2P [x>ixj]P (23)= −2[(xi− µ)>(xj− µ)] (24)(25)where we used the fact P [ai]n×nP = P[ai]n×111×nP = P[ai]n×10 = 0. This suggests that given a squareddistance matrix [kxi− xjk2], the matrix¯A = −12P [kxi− xjk2]P (26)gives us the centered inner product matrix [(xi− µ)>(xj− µ)].Theorem 1 Consider the class of symmetric matrices A ∈ Snsuch that Aij≥ 0 and Aii= 0, ∀i, j. Then¯A = −12P AP is positive semi-definite if and only if A is a squared distance matrix. The minimal embeddingdimension d is the rank of¯A.Now that we have¯A, we need to actually find an embedding. To this end, we perform eigen-decomposition¯A =nXk=1λkφkφ>k. (27)where λ is sorted in decreasing order and the first d will be nonzero. Now consider the n × d matrixY = [φ1|. . . |φd]Λ−1/2(28)where Λ−1/2is the d × d diagonal matrix with the k-th diagonal element√λk. It is easy to see that¯A = Y Y>. (29)We take the rows of Y to be the embedding of the n items.The minimum embedding dimension d is the dimension for which there is no distortion. One can ask foran even smaller dimensional embedding d0< d, where one simply use s the top d0eigenvector / eigenvaluesin (28).Classical MDS has been generalized to non-me tric MDS, where the input is not squared distances butrather some general notion of dissimilarities. Ordinal MDS is an example of non-metric MDS where theinput is the ranking of pairwise distances, but not the distances themselves.Dimensionality Reduction 43 IsomapPCA and MDS both assume that the data lives in a Euclidean (sub)space. Isomap assumes that the datalives on a low dimensional manifold embedded in a Euclidean space. In this case, we are given the featurerepresentation x1, . . . , xn∈ RD. But instead of computing the Euclidean distanceskxi− xjk2. (30)we should rather use the geodesic distance along the manifold

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

UW-Madison CS 731 - Dimensionality Reduction

Sign up for free to view:

Please select your school