1Eric Xing 1Machine LearningMachine Learning1010--701/15701/15--781, Fall 2006781, Fall 2006Dimensionality Reduction IIDimensionality Reduction IIFactor Analysis and Metric Factor Analysis and Metric LearningLearningEric XingEric XingLecture 20, November 22, 2006Reading: Chap. ??, C.B bookEric Xing 2Outlinez Probabilistic PCA (breif)z Factor Analysis (somewhat detail)z ICA (will skip)z Distance metric learning from very little side info (a very coolmethod)2Eric Xing 3z Popular dimensionality reduction techniquez Project data onto directions of greatest variationz Consequence:z xiare uncorrelated such that the covariance matrix is z Truncation errorRecap of PCA()()uyCovuuyymuuymuTmiTiiTmiTi)(maxargmaxargmaxarg=⎟⎠⎞⎜⎝⎛==∑∑==11211rrrqiTqiTqiTiTiyUyuyuyux R∈=⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎣⎡=rrMrrr21y1 y2 () ()xqkTkkkKkTkkkyuuuu Σ=≈=Σ∑∑== 11γγ∑=miTiixxm11rr⎥⎥⎥⎦⎤⎢⎢⎢⎣⎡qγγO1iixUyrr=Eric Xing 4z Popular dimensionality reduction techniquez Project data onto directions of greatest variationUseful tool for visualising patterns and clusters within the data set, but …Need centeringDoes not explicitly model data noiseRecap of PCA3Eric Xing 5Probabilistic Interpretation?AYXcontinuouscontinuousregressionAYXcontinuouscontinuous?Eric Xing 6Probabilistic PCAz PCA can be cast as a probabilistic modelwith q-dimensional latent variables z The resulting data distribution is z Maximum likelihood solution is equivalent to PCADiagonal Γqcontains the top q sample covariance eigen-values and Uqcontains associated eigenvectorsTipping and Bishop, J. Royal Stat. Soc. 6, 611 (1999).nnnxyεµ++Λ=),(~ In20σεN),(~ Ixn0N),(~ IyTn2σµ+ΛΛN∑=nnMLyN1µ212 /)( IUqqMLσ−Γ=Λ4Eric Xing 7Factor analysisz An unsupervised linear regression modelz Geometric interpretationz To generate data, first generate a point within the manifold then add noise. Coordinates of point are components of latent variable.AYXwhere Λ is called a factor loading matrix, and Ψ is diagonal. ),;()(),;()(ΨΛ+==xyxyxxµNNppI0Eric Xing 8Relationship between PCA and FAz Probabilistic PCA is equivalent to factor analysis with equalnoise for every dimension, i.e., εn~ isotropic Gaussianz In factor analysis for a diagonal covariance matrix z An iterative algorithm (eg. EM) is required to find parameters if precisions are not known in advance),(~Ψ0NnεΨ),( I20σN5Eric Xing 9Factor analysisz An unsupervised linear regression modelz Geometric interpretationz To generate data, first generate a point within the manifold then add noise. Coordinates of point are components of latent variable.AYXwhere Λ is called a factor loading matrix, and Ψ is diagonal. ),;()(),;()(ΨΛ+==xyxyxxµNNppI0Eric Xing 10Marginal data distribution z A marginal Gaussian (e.g., p(x)) times a conditional Gaussian (e.g., p(y|x)) is a joint Gaussianz Any marginal (e.g., p(y) of a joint Gaussian (e.g., p(x,y)) is also a Gaussianz Since the marginal is Gaussian, we can determine it by just computing its mean and variance. (Assume noise uncorrelated with data.)[][][] [ ][]()()[]()()[]()()[][] [ ]Ψ+ΛΛ=+ΛΛ=+Λ+Λ=−+Λ+−+Λ+=−−==++=+Λ+=Ψ+Λ+=TTTTTTTEEEEEVarEEEEWWXXWXWXWXWXYYYWXWWXYµµµµµµµµµµ000 ),(~ here wNAYX6Eric Xing 11FA = Constrained-Covariance Gaussianz Marginal density for factor analysis (y is p-dim, x is k-dim):z So the effective covariance is the low-rank outer product of two long skinny matrices plus a diagonal matrix:z In other words, factor analysis is just a constrained Gaussian model. (If were not diagonal then we could model any Gaussian and it would be pointless.)),;()|( Ψ+ΛΛ=TµθyyNpEric Xing 12Review:A primer to multivariate Gaussianz Multivariate Gaussian density:z A joint Gaussian: z How to write down p(x1), p(x1|x2) or p(x2|x1) using the block elements in µand Σ?z Formulas to remember:{})-()-(-exp)(),|(//µµπµxxx12121221−ΣΣ=ΣTnp) , (),|(⎥⎦⎤⎢⎣⎡ΣΣΣΣ⎥⎦⎤⎢⎣⎡⎥⎦⎤⎢⎣⎡=Σ⎥⎦⎤⎢⎣⎡22211211212121µµµxxxxNp2112212112122122121212121121ΣΣΣ−Σ=−ΣΣ+==−−||||)(),|()(VxmVmxxxµµNp222222222Σ===mmmmpVmVmxxµ),|()(NAX2X17Eric Xing 13Review:Some matrix algebraz Trace and derivativesz Cyclical permutationsz Derivativesz Determinants and derivatives[]∑=iiiadeftr A[]TBBAA=∂∂tr[][]TTTxxAxxAAxxA=∂∂=∂∂trtr[][][]BCACABABC trtrtr==-TAAA=∂∂logEric Xing 14FA joint distributionz Modelz Covariance between x and yz Hence the joint distribution of x and y:z Assume noise is uncorrelated with data or latent variables.),;()(),;()(ΨΛ+==xyxyxxµNNppI0[]()()[]()[][]TTTTTTEEECovΛ=+Λ=−+Λ+=−−=XWXXWXXYXYX,µµµ0) , ()(⎥⎦⎤⎢⎣⎡Ψ+ΛΛΛΛ⎥⎦⎤⎢⎣⎡⎥⎦⎤⎢⎣⎡=⎥⎦⎤⎢⎣⎡TTIµ0yxyxNp8Eric Xing 15Inference in Factor Analysisz Apply the Gaussian conditioning formulas to the joint distribution we derived above, wherewe can now derive the posterior of the latent variable x given observation y, , whereApplying the matrix inversion lemmaz Here we only need to invert a matrix of size |x|×|x|, instead of |y|×|y|.())()(|µµµ−Ψ+ΛΛΛ=−ΣΣ+=−−yym1212212121TT()Ψ+ΛΛ=ΣΛ=Σ=Σ=ΣTTTI22121211),|()(|| 2121VmxyxN=p⇒⇒()()1111 −−−−+= FG()ΛΨ+ΛΛΛ−=ΣΣΣ−Σ=−−121122121121TTI|V()1121−−ΛΨΛ+=TI|V)(||µ−ΨΛ=−yVm12121TEric Xing 16Geometric interpretation: inference is linear projectionz The posterior is:z Posterior covariance does not depend on observed data y!z Computing the posterior mean is just a linear operation:),;()(|| 2121VmxyxN=p()1121−−ΛΨΛ+=TI|V)(||µ−ΨΛ=−yVm12121T9Eric Xing 17EM for Factor Analysisz Incomplete data log likelihood function (marginal density of y)z Estimating m is trivial: z Parameters Λ and Ψ are coupled nonlinearly in log-likelihoodz Complete log likelihood ()()[]TnnnnTnnµyµy NyyND)()( where , trlog)()(log),(−−=Ψ+ΛΛ−Ψ+ΛΛ−=−Ψ+ΛΛ−−Ψ+ΛΛ−=∑∑−−SS11212212TTTTµµθl∑=nnNMLy1µˆ[] []TnnnnnnTnnnnTnnnnnTnnnnnnnnxyxyN NxxNxyxyNxxNxypxpyxpD)()( where , trtrlog)()(loglog)|(log)(log),(log),(Λ−Λ−=Ψ−−Ψ−=Λ−ΨΛ−−Ψ−−−=+==∑∑∑∑∑∑−−1221221221211SSIθclEric Xing 18E-step for Factor Analysis z Computez Recall that we have derived: [][] trtrlog),(12212−Ψ−−Ψ−=∑SNXXNDnTnnθcl)|(),(yxpDθcl)(TnTnnTnTnTTnnTnnXXyXXyyyN
View Full Document