Princeton COS 424 - Lecture #22 - D1280539

Home> Schools> Princeton University> Computer Science (COS) > COS 424> Lecture #22

Princeton COS 424 - Lecture #22

Pages 5

Download Save

Unformatted text preview:

DRAFT — a final version will be posted shortlyCOS 424: Interacting with DataLecturer: Dave Blei Lecture #22Scribe: CJ Bell and Ana Pop April 24, 20081 Principal Component Analysis (PCA)PCA is one method used to reduce the number of features used to represent data. Thebenefits of this dimensionality reduction include providing a simpler representation of thedata, reduction in memory, and faster classification. We accomplish by projecting datafrom a higher dimension to a lower dimensional manifold such that the error incurred byreconstructing the data in the higher dimension is minimized.Figure 1: A plot of x’s in 2D (Rp) space and an example 1D (Rq) space (dashed line) towhich the data can be projected.An example of this is given by Figure 1, where 2D data can be projected to the 1Dspace represented by the dashed line with reasonably small error. In general, we want tomap x ∈ Rpto ˜x ∈ Rqwhere q < p.1.1 Idea Behind PCA• Draw some lower dimensional space. In Figure 1, this is the dashed line.• Represent each data point by its projection along the line.In Figure 1, the free parameter is the slope. We draw the line to minimize the distancesto the points. Note that in regression, the distance to the line is vertical, not perpendicular,as shown by Figure 2.1.2 PCA InterpretationPCA can be interpreted in three different ways.• Maximize the variance of projection along each component.• Minimize the reconstruction error (ie. the squared distance between the original dataand its “estimate”).• Some MLE of a parameter in a probabilistic model.Figure 2: Projecting x to R1. The vertical line is the regression mapping and the perpen-dicular line is the PCA projection.1.3 PCA DetailsGiven data points x1, x2, ..., xn∈ Rp.We define the reconstruction of data in Rqto Rpasf(λ) = µ + vqλ (1)In this rank q model, the mean is µ ∈ Rpand vqis a p × q matrix with q orthogonal unitvectors. Finally, λ ∈ Rqis the low-dimensional data points we are projecting.Creating a good low-dimensional representation of the data requires that we carefullychoose µ, vq, and λ. One way we can do this is by minimizing the reconstruction error givenbyminµ,λ1...N,vqNXn=1||xn− µ − vqλn|| (2)In Equation 2, µ is the intercept of the lower space in the higher space. Next, λ1...Nis the Rqcoordinate of x, or where x lies on the line in Figure 1. We define the Rpplaneusing vqand µ. Last, the quantity inside the sum is the distance between the original dataand the low-dimensional representation reconstruction in the original space (the L2distancebetween the original data and the projection).Figure 3: Projecting R3data to R2We next present an example where the number three is recognized from handwrittentext, as shown in Figure 4. Each image is a datapoint in R256where a pixel is a dimensionthat varies between white and black. When reducing to two dimensions, the principalcomponents are λ1and λ2. We can reconstruct a R256datapoint from a R2point usingˆf(λ) = + λ1· + λ2· (3)2Figure 4: 130 samples of handwritten threes in a variety of writing styles.Instead of minimizing the reconstruction error, however, we maximize the variance with theobjective functionminvqNXn=1||xn− vqvTqxn||2(4)From Equation 2, fitting a PCA (Equation 4) is the same as minimizing the reconstruc-tion error. The optimal intercept is the sample mean µ∗= x. Without loss of generality,assume µ∗= 0 and x = x − µ∗. The projection vqon xnis λn= vTqxn. Now we find theprinciple components vq. These are the places where to put the data to reconstruct withminimum error. We get the solution to vqusing singular value decomposition (SVD).1.4 SVDConsiderX = UDVT(5)where• X is an N × p matrix.• U is an N × p orthogonal matrix and the columns of U are linearly independent.• D is a positive p × p diagonal matrix with d11≥ d22≥ ... ≥ dpp.• VTis a p × p orthogonal matrix.We represent each data point as linear combinations.x1= u11d1¯v1+ u12d2¯v2+ ... + u1pdp¯vpx2= u21d2¯v1+ u22d2¯v2+ ... + u1pdp¯vp...3Figure 5:We can embed x into an orthogonal space via rotation. D scales, V rotates, and U is aperfect circle.PCA cuts off SVD at q dimensions. In Figure 6, U is a low dimensional representation.Examples 3 and 1.3 use q = 2 and N = 130. D reflects the variance so we cut off dimensionswith low variance (remember d11≤ d22...). Lastly, V are the principle components.Figure 6:2 Factor AnalysisFigure 7: The hidden variable is the point on the hyperplane (line). The observed value isx, which is dependant on the hidden variable.Factor analysis is another dimension-reduction technique. The low-dimension represen-tation of higher-dimensional space is a hyperplane drawn through the high dimensionalspace. For each datapoint, we select a point on the hyperplane and choose data from theGaussian around that point. These chosen points are observable whereas the point on thehyperplane is latent.42.1 Multivariate GaussianThis is a Gaussian for p-vectors characterized by• mean µ, a p-vector• covariance matrixP, a p × p positive-definite, and symmetricσij= E[xixj] − E[xi]E[xj] (6)Some observations:• A data point is x :< x1...xp> vector which is also a random variable.• If xi, xjare independent, σij= 0• σijis the covariance between components i and j.• σii= E[x2i] − E[xi]2= var(xi)The density function is over vectors of length p.p(x|µiΣ) =1(2π)p/2|Σ1/2|exp12(x − µ)TΣ−1(x − µ) (7)Note that |Σ| = det(Σ) and that (x − µ) is a p-vector.We now define contours of constant probability density as f(x) =12(x − µ)TΣ−1(x − µ).These are points where the multivariate Gaussian is the same. They are points on an ellipse.2.2 MLEThe optimal sample mean, ˆµ, is a p-vector andˆΣ is how often two components are largetogether or small together for positive covariances.ˆµ =1NNXn=1xn(8)ˆΣ =1NNXn=1(xn− ˆµ)(xn− ˆµ)T(9)2.3 Factor AnalysisThe parameters are Λ, a q dimensional subspace in p space and a q × q matrix, and Ψ, adiagonal and positive p × p matrix.For each data point,Zn∼ Nq(→0, I) means it has mean of 0 and each component is an independent Gaussian.xn∼ Np(Λz, Ψ) means it has mean of Λz and diagonal covariance matrix Ψ.In PCA, x = z1λ1+ z2λ2+ ... + zqλqIn FA, x ˜ N (z1λ1+ ... + zqλq), ΨFit FA with

View Full Document


School:
Email:
New Password:
Confirm Password:

Princeton COS 424 - Lecture #22

Sign up for free to view:

Please select your school