Machine Learning 1010 701 15701 15 781 Fall 2006 Dimensionality Reduction II Factor Analysis and Metric Learning Eric Xing Lecture 20 November 22 2006 Reading Chap C B book Eric Xing 1 Outline z Probabilistic PCA breif z Factor Analysis somewhat detail z ICA will skip z Distance metric learning from very little side info a very cool method Eric Xing 2 1 Recap of PCA z Popular dimensionality reduction technique z Project data onto directions of greatest variation u arg max 1 m rT yi u m i 1 2 r xi rr 1 arg max u T yi yiT u m i 1 arg max u T Cov y u y2 m r yi r u1T yi Tr u y r 2 i U qT yi R q M Tr uq yi r Uxi y1 z Consequence z xi are uncorrelated such that the covariance matrix z Truncation error K q 1 m r rT xi xi is m i 1 1 O q y k uk ukT k uk ukT x k 1 k 1 Eric Xing 3 Recap of PCA z Popular dimensionality reduction technique z Project data onto directions of greatest variation Useful tool for visualising patterns and clusters within the data set but Need centering Does not explicitly model data noise Eric Xing 4 2 Probabilistic Interpretation continuous X continuous X continuous A Y continuous A Y regression Eric Xing 5 Probabilistic PCA z PCA can be cast as a probabilistic model yn xn n n N 0 2 I with q dimensional latent variables xn N 0 I z The resulting data distribution is yn N T 2 I z Maximum likelihood solution is equivalent to PCA ML 1 N y n ML U q q 2 I 1 2 n Diagonal q contains the top q sample covariance eigen values and Uq contains associated eigenvectors Eric Xing Tipping and Bishop J Royal Stat Soc 6 611 1999 6 3 Factor analysis z An unsupervised linear regression model X p x N x 0 I p y x N y x A Y z where is called a factor loading matrix and is diagonal Geometric interpretation z To generate data first generate a point within the manifold then add noise Coordinates of point are components of latent variable Eric Xing 7 Relationship between PCA and FA z z z Probabilistic PCA is equivalent to factor analysis with equal noise for every dimension i e n isotropic Gaussian N 0 2 I In factor analysis n N 0 for a diagonal covariance matrix An iterative algorithm eg EM is required to find parameters if precisions are not known in advance Eric Xing 8 4 Factor analysis z An unsupervised linear regression model X p x N x 0 I p y x N y x A Y z where is called a factor loading matrix and is diagonal Geometric interpretation To generate data first generate a point within the manifold then add noise Coordinates of point are components of latent variable z Eric Xing 9 Marginal data distribution z z A marginal Gaussian e g p x times a conditional Gaussian e g p y x is a joint Gaussian Any marginal e g p y of a joint Gaussian e g p x y is also a Gaussian Since the marginal is Gaussian we can determine it by just computing its mean and variance Assume noise uncorrelated with data z E Y E X W where W N 0 E X E W X 0 0 E X W X W E X W X W Var Y E Y Y T T A Y T E XX E WW T T T T Eric Xing 10 5 FA Constrained Covariance Gaussian z Marginal density for factor analysis y is p dim x is k dim p y N y T z So the effective covariance is the low rank outer product of two long skinny matrices plus a diagonal matrix z In other words factor analysis is just a constrained Gaussian model If were not diagonal then we could model any Gaussian and it would be pointless Eric Xing 11 Review A primer to multivariate Gaussian z Multivariate Gaussian density p x z 1 2 n 2 1 2 T 1 X1 exp x x 1 2 A2 X A joint Gaussian 12 x x p 1 N 1 1 11 x2 x2 2 21 22 z How to write down p x1 p x1 x2 or p x2 x1 using the block elements in and z Formulas to remember p x2 N x2 m 2m V2m m 2m 2 V2m 22 Eric Xing p x1 x2 N x1 m1 2 V1 2 1 x 2 2 m1 2 1 12 22 1 21 V1 2 11 12 22 12 6 Review Some matrix algebra z tr A aii def Trace and derivatives z Cyclical permutations z Derivatives i tr ABC tr CAB tr BCA tr BA B T A tr xT Ax tr xxT A xxT A A z Determinants and derivatives log A A T A Eric Xing 13 FA joint distribution z Model p x N x 0 I p y x N y x z Covariance between x and y Cov X Y E X 0 Y E X X W T E XX XW T T T T T z Hence the joint distribution of x and y x x 0 I T p N T y y z Assume noise is uncorrelated with data or latent variables Eric Xing 14 7 Inference in Factor Analysis z Apply the Gaussian conditioning formulas to the joint distribution we derived above where 11 I 12 12 T T 22 T we can now derive the posterior of the latent variable x given observation y p x y N x m1 2 V1 2 where 1 m1 2 1 12 22 y 2 T T 1 z I T T 1 y Applying the matrix inversion lemma 1 V1 2 11 12 22 21 V1 2 I T 1 1 G 1 1 1 F 1 m1 2 V1 2 T 1 y Here we only need to invert a matrix of size x x instead of y y Eric Xing 15 Geometric interpretation inference is linear projection z The posterior is p x y N x m1 2 V1 2 V1 2 I T 1 1 m1 2 V1 2 T 1 y z Posterior covariance does not depend on observed data y z Computing the posterior mean is just a linear operation Eric Xing 16 8 EM for Factor Analysis z Incomplete data log likelihood function marginal density of y N 1 1 T yn T yn 2 n 1 N 1 log T tr T S where S 2 2 l D z z z 2 log T ML N1 n yn Parameters and are coupled nonlinearly in log likelihood n yn yn T Estimating m is trivial Complete log likelihood lc D log p xn yn log p xn log p yn xn n n N N 1 1 T log I xnT xn log yn …
View Full Document