DOC PREVIEW
CMU STA 36402-36608 - Lecture

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Two Routes to Mixture ModelsFrom Factor Analysis to Mixture ModelsFrom Kernel Density Estimates to Mixture ModelsMixture ModelsGeometryIdentifiabilityProbabilistic ClusteringEstimating Parametric Mixture ModelsMore about the EM AlgorithmFurther Reading on and Applications of EMTopic Models and Probabilistic LSANon-parametric Mixture ModelingRExercises19. Mixture Models, Latent Variables and theEM Algorithm36-402, Advanced Data Analysis31 March 2011Contents1 Two Routes to Mixture Models 11.1 From Factor Analysis to Mixture Models . . . . . . . . . . . . . . 11.2 From Kernel Density Estimates to Mixture Models . . . . . . . . 21.3 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.6 Probabilistic Clustering . . . . . . . . . . . . . . . . . . . . . . . 52 Estimating Parametric Mixture Models 52.1 More about the EM Algorithm . . . . . . . . . . . . . . . . . . . 72.2 Further Reading on and Applications of EM . . . . . . . . . . . . 102.3 Topic Models and Probabilistic LSA . . . . . . . . . . . . . . . . 103 Non-parametric Mixture Modeling 114 R 115 Exercises 111 Two Routes to Mixture Models1.1 From Factor Analysis to Mixture ModelsIn factor analysis, the origin myth is that we have a fairly small number, q ofreal variables which happen to be unobserved (“latent”), and the much largernumber p of variables we do observe arise as linear combinations of these factors,plus noise. The mythology is that it’s possible for us (or for Someone) tocontinuously adjust the latent variables, and the distribution of observableslikewise changes continuously. What if the latent variables are not continuousbut ordinal, or even categorical? The natural idea would be that each value ofthe latent variable would give a different distribution of the observables.11.2 From Kernel Density Estimates to Mixture ModelsWe have also previously looked at kernel density estimation, where we approxi-mate the true distribution by sticking a small (1nweight) copy of a kernel pdf ateach observed data point and adding them up. With enough data, this comesarbitrarily close to any (reasonable) probability density, but it does have somedrawbacks. Statistically, it labors under the curse of dimensionality. Compu-tationally, we have to remember all of the data points, which is a lot. We sawsimilar problems when we looked at fully non-parametric regression, and thensaw that both could be ameliorated by using things like additive models, whichimpose more constraints than, say, unrestricted kernel smoothing. Can we dosomething like that with density estimation?Additive modeling for densities is not as common as it is for regression —it’s harder to think of times when it would be natural and well-defined1— butwe can do things to restrict density estimation. For instance, instead of puttinga copy of the kernel at every point, we might pick a small number K  n ofpoints, which we feel are somehow typical or representative of the data, andput a copy of the kernel at each one (with weight1K). This uses less memory,but it ignores the other data points, and lots of them are probably very similarto those points we’re taking as prototypes. The differences between prototypesand many of their neighbors are just matters of chance or noise. Rather thanremembering all of those noisy details, why not collapse those data points, andjust remember their common distribution? Different regions of the data spacewill have different shared distributions, but we can just combine them.1.3 Mixture ModelsMore formally, we say that a distribution f is a mixture of K componentdistributions f1, f2, . . . fKiff(x) =KXk=1λkfk(x) (1)with the λkbeing the mixing weights, λk> 0,Pkλk= 1. Eq. 1 is a completestochastic model, so it gives us a recipe for generating new data points: firstpick a distribution, with probabilities given by the mixing weights, and thengenerate one observation according to that distribution. Symbolically,Z ∼ Mult(λ1, λ2, . . . λK) (2)X|Z ∼ fZ(3)where I’ve introduced the discrete random variable Z which says which compo-nent X is drawn from.1Remember that the integral of a probability density over all space must be 1, while theintegral of a regression function doesn’t have to be anything in particular. If we had anadditive density, f (x) =Pjfj(xj), ensuring normalization is going to be very tricky; we’dneedPjRfj(xj)dx1dx2dxp= 1. It would be easier to ensure normalization while makingthe log-density additive, but that assumes the features are independent of each other.2I haven’t said what kind of distribution the fks are. In principle, we couldmake these completely arbitrary, and we’d still have a perfectly good mixturemodel. In practice, a lot of effort is given over to parametric mixture mod-els, where the fkare all from the same parametric family, but with differentparameters — for instance they might all be Gaussians with different centersand variances, or all Poisson distributions with different means, or all powerlaws with different exponents. (It’s not strictly necessary that they all be ofthe same kind.) We’ll write the parameter, or parameter vector, of the kthcomponent as θk, so the model becomesf(x) =KXk=1λkf(x; θk) (4)The over-all parameter vector of the mixture model is thus θ = (λ1, λ2, . . . λK, θ1, θ2, . . . θK).Let’s consider two extremes. When K = 1, we have a simple parametricdistribution, of the usual sort, and density estimation reduces to estimating theparameters, by maximum likelihood or whatever else we feel like. On the otherhand when K = n, the number of observations, we have gone back towardskernel density estimation. If K is fixed as n grows, we still have a parametricmodel, and avoid the curse of dimensionality, but a mixture of (say) ten Gaus-sians is more flexible than a single Gaussian — thought it may still be the casethat the true distribution just can’t be written as a ten-Gaussian mixture. Sowe have our usual bias-variance or accuracy-precision trade-off — using manycomponents in the mixture lets us fit many distributions very accurately, withlow approximation error or bias, but means we have more parameters and so wecan’t fit any one of them as precisely, and there’s more variance in our estimates.1.4 GeometryTwo lectures ago, we looked at principal components


View Full Document

CMU STA 36402-36608 - Lecture

Documents in this Course
Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?