10 701 Recitation 10 PCA and Topic Models Latent Space Methods PCA Topic Models are latent space methods Unsupervised Reduce data to fewer dimensions Easier to visualize and understand PCA Topic Model Principal Component Analysis Key idea find vectors that capture most of the variability in the data X These vectors are the principal components Principal Component Analysis Let s say the data X has n rows and p columns Every row xi is a p dimensional data point We assume that X has zero mean What quantity represents the variability of X Answer The p by p covariance matrix XTX which is equal to Principal Component Analysis What does it mean for a vector to capture variability in the data Let C XTX be the covariance matrix We want a unit length vector u that maximizes uTCu Why do we want this Intuition let v Cu so we are maximizing uTv uTv is high when The magnitude of v is large The angle between u and v is small In other words we want to find u such that C makes u longer C doesn t change the angle of u uTCu is maximized when u is the principal eigenvector of C Hence Cu u where is the principal largest eigenvalue of C Graphically the principal eigenvector gives the direction of highest variability Principal Component Analysis Graphically the principal eigenvector gives the direction of highest variability Principal Component Analysis We found the first principal component u1 the principal eigenvector How do we find the other principal components Again find a unit length u that maximizes uTCu such that u is perpendicular to u1 The solution is the second eigenvector u2 Next maximize uTCu s t u perpendicular to u1 and u2 which gives the third eigenvector u3 And so on Principal Component Analysis We maximize uTCu s t u perpendicular to u1 Finding the eigenvectors MATLAB code to find the top k eigenvectors V D eigs X X k V is p by k D is k by k V contains top k eigenvectors as columns D contains top k eigenvalues on its diagonal Principal Component Analysis So far we ve talked about eigenvectors of C What s the connection with a latent space Let s project the data points onto the 1st PC eigenvector Notice how the points didn t move much Principal Component Analysis If we pick the top k PCs and project the data X onto them we get a lower dimensional latent space representation of the data Note that k p original data dimensionality By picking the top k PCs we ensure the latent space distorts the data as little as possible Principal Component Analysis How do we project the data on the top k PCs MATLAB code V D eigs X X k W X V W is n by k and is the projection of X onto the top k PCs Wij represents how much Xi depends on the j th PC Principal Component Analysis The projection W can be thought of as a compressed version of X In some cases we can even interpret the columns of W as concepts How do we reconstruct the data from W MATLAB code V D eigs X X k W X V Y W V Y is n by p and is the reconstruction of X from the top k PCs If the top k PCs capture most of the data variability then Y will be similar to X Summary of PCA PCA projects p dimensional data X onto a kdimensional latent space W where k p Properties of the latent space W Captures most of the variability in X Easier to visualize and understand than X Less storage than X We can reconstruct X from W with minimal error Topic Models Setting want to organize documents represented as high dimensional bags of words Key idea find K topics collections of words so we can describe documents as combinations of topics Note that topics may overlap Topic Models A Topic Model is a Bayesian generative model Observed data depends on hidden variables which can depend on other hidden variables etc Can be graphically depicted as a Bayes Net You already know another generative model K Gaussians Mixture Model Topic Models The Topic Model generative process describes the model in a compact form Draw topic vocabularies k 1 K k Dirichlet Draw document topic vectors i 1 N i Dirichlet For each document i 1 N Draw words j 1 Mi zij Multinomial i word topic indicator wij Multinomial zij the word itself Topic Models Graphical Model illustration Generative Process For k 1 K k k Dirichlet K For i 1 N i Dirichlet i zij For i 1 N j 1 Mi wij Mi N zij Multinomial i wij Multinomial zij Reading Plate Notation Graphical Model illustration This is a non random parameter This is a plate Generative Process Variables inside the For k 1 K plate are duplicated K Dirichlet k times in this case For i 1 N i Dirichlet Dependencies arrows are also duplicated For i 1 N j 1 Mi k K i zij wij Mi This is a hidden random variable N This is an observed random variable zij Multinomial i wij Multinomial zij Reading Plate Notation 1 2 is the same as k K K Correspondence with Gen Process Graphical Model illustration Generative Process For k 1 K k k Dirichlet K For i 1 N i Dirichlet i zij For i 1 N j 1 Mi wij Mi N zij Multinomial i wij Multinomial zij Correspondence with Gen Process Graphical Model illustration Generative Process For k 1 K k k Dirichlet K For i 1 N i Dirichlet i zij For i 1 N j 1 Mi wij Mi N zij Multinomial i wij Multinomial zij Correspondence with Gen Process Graphical Model illustration Generative Process For k 1 K k k Dirichlet K For i 1 N i Dirichlet i zij For i 1 N j 1 Mi wij Mi N zij Multinomial i wij Multinomial zij Correspondence with Gen Process Graphical Model illustration Generative Process For k 1 K k k Dirichlet K For i 1 N i Dirichlet i zij For i 1 N j 1 Mi wij Mi N zij Multinomial i wij Multinomial zij From Generative Process to Inference We just saw the topic model generative process and its corresponding graphical model Given just the parameters and we could generate random data from the topic model But that s not what we want We want to represent documents in terms of topics So we need to find the topic vocabularies and the document topic vectors From Generative Process to Inference We find and via probabilistic inference In other words find the distribution of and conditioned on the observed words w and parameters This is the same principle as Viterbi for HMMs and variable elimination for Bayes Nets I won t go into details here that s for HW5 From Generative Process to Inference Topic model inference isn t easy We need to marginalize sum out the word topic indicators z But there are exponentially …
View Full Document