Dirichlet Process Mixtures A gentle tutorialMotivationSlide 3What to do?Review: Dirichlet DistributionDirichlet ProcessSlide 7Slide 8Slide 9Sampling from a DPSlide 11Slide 12Chinese Restaurant ProcessSlide 14Dirichlet Process MixtureCRP MixtureStick BreakingSlide 18Formal DefinitionInference in a DPMGibbs Sampling [Neal 2000]Gibbs Sampling [WAS 22-DAL 19]Slide 23Slide 24ConclusionAcknowledgmentsReferences1Dirichlet Process Mixtures A gentle tutorialGraphical Models – 10708Khalid El-AriniCarnegie Mellon UniversityNovember 6th, 200610-7082 We are given a data set, and are told that it was generated from a mixture of Gaussians.Unfortunately, no one has any idea how many Gaussians produced the data.Motivation10-7083 We are given a data set, and are told that it was generated from a mixture of Gaussians.Unfortunately, no one has any idea how many Gaussians produced the data.Motivation10-7084 What to do?We can guess the number of clusters, do EM for Gaussian Mixture Models, look at the results, and then try again…We can do hierarchical agglomerative clustering, and cut the tree at a visually appealing level…We want to cluster the data in a statistically principled manner, without resorting to hacks.10-7085 Review: Dirichlet DistributionLet We write:Distribution over possible parameter vectors for a multinomial distribution, and is in fact the conjugate prior for the multinomial.Beta distribution is the special case of a Dirichlet for 2 dimensions.Samples from the distribution lie in the m-1 dimensional simplexThus, it is in fact a “distribution over distributions.”10-7086 Dirichlet ProcessA Dirichlet Process is also a distribution over distributions.We write:G ~ DP(α, G0)G0 is a base distributionα is a positive scaling parameterG has the same support as G010-7087 Dirichlet ProcessConsider Gaussian G0G ~ DP(α, G0)10-7088 Dirichlet ProcessG ~ DP(α, G0)G0 is continuous, so the probability that any two samples are equal is precisely zero.However, G is a discrete distribution, made up of a countably infinite number of point masses [Blackwell]Therefore, there is always a non-zero probability of two samples colliding10-7089 Dirichlet ProcessG ~ DP(α1, G0)G ~ DP(α2, G0)α values determine how closeG is to G010-70810 Sampling from a DPG ~ DP(α, G0)Xn | G ~ G for n = {1, …, N} (iid)Marginalizing out G introduces dependenciesbetween the Xn variablesGXnN10-70811 Sampling from a DPAssume we view these variables in a specific order, and are interested in the behavior of Xn given the previous n - 1 observations.Let there be K unique values for the variables:10-70812 Sampling from a DP Notice that the above formulation of the joint does not depend on the order we consider the variables. We can arrive at a mixture model by assuming exchangeability and applying DeFinetti’s Theorem (1935).Chain ruleP(partition) P(draws)10-70813 Chinese Restaurant ProcessCan rewrite as:Let there be K unique values for the variables:10-70814 Chinese Restaurant ProcessConsider a restaurant with infinitely many tables, where the Xn’s represent the patrons of the restaurant. From the above conditional probability distribution, we can see that a customer is more likely to sit at a table if there are already many people sitting there. However, with probability proportional to α, the customer will sit at a new table.Also known as the “clustering effect,” and can be seen in the setting of social clubs. [Aldous]10-70815 Dirichlet Process MixtureGηnNynG0αcountably infinite number of point massesdraw N times from G to get parameters for different mixture componentsIf ηn were drawn from e.g. a Gaussian, no two values would be the same, but since they are drawn from a distribution drawn from a Dirichlet Process, we expect a clustering of the ηn# unique values for ηn = # mixture components10-70816 CRP Mixture10-70817 Stick BreakingSo far, we’ve just mentioned properties of a distribution G drawn from a Dirichlet ProcessIn 1994, Sethuraman developed a constructive way of forming G, known as “stick breaking”10-70818 Stick Breaking1. Draw η1* from G0 2. Draw v1 from Beta(1, α)4. Draw η2* from G0 3. π1 = v1… 5. Draw v2 from Beta(1, α)6. π2 = v2(1 – v1)10-70819 Formal DefinitionLet α be a positive, real-valued scalarLet G0 be a non-atomic probability distribution over support set AWe say G ~ DP(α, G0), if for all natural numbers k and k-partitions {A1, …, Ak},10-70820 Inference in a DPMEM is generally used for inference in a mixture model, but G is nonparametric, making EM difficultMarkov Chain Monte Carlo techniques [Neal 2000]Variational Inference [Blei and Jordan 2006]GηnNynG0α10-70821 Gibbs Sampling [Neal 2000]Algorithm 1:Define Hi to be the single observation posteriorWe marginalize out G from our model, and sample each ηn given everything elseGηnNynG0αSLOW TO CONVERGE!10-70822 Gibbs Sampling [WAS 22-DAL 19]10-70823 Gibbs Sampling [Neal 2000]Algorithm 2:GηnNynG0αcnNynG0αηc∞[Grenager 2005]10-70824 Gibbs Sampling [Neal 2000]Algorithm 2 (cont.):We sample from the distribution over an individual cluster assignment cn given yn, and all the other cluster assignments1. Initialize cluster assignments c1, …, cN2. For i=1,…,N, draw ci from:3. For all c, draw ηc | yi (for all i such that ci = c)if c = cj for some j ≠ iotherwise10-70825 We now have a statistically principled mechanism for solving our original problem.This was intended as a general and fairly shallow overview of Dirichlet Processes.Conclusion10-70826 AcknowledgmentsMuch thanks goes to David Blei.Some material for this presentation was inspired by slides from Teg Grenager and Zoubin Ghahramani.10-70827 ReferencesBlei, David M. and Michael I. Jordan. “Variational inference for Dirichlet process mixtures.” Bayesian Analysis 1(1), 2006.R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249-265, 2000.Ghahramani, Zoubin. “Non-parametric Bayesian Methods.” UAI Tutorial July 2005.Grenager, Teg. “Chinese Restaurants and Stick Breaking: An Introduction to the Dirichlet Process”Blackwell, David and James B. MacQueen. “Ferguson Distributions via Polya Urn Schemes.” The Annals of Statistics 1(2), 1973, 353-355.Ferguson, Thomas S. “A Bayesian Analysis of Some Nonparametric Problems” The
View Full Document