Unsupervised Learning Techniques9.520 Class 07, 1 March 2006Andrea CaponnettoAbout this classGoal To introduce some methods for unsupervised learn-ing: Gaussian Mixtures, K-Means, ISOMAP, HLLE,Laplacian Eigenmaps.Unsupervised learningOnly u i.i.d. samples drawn on X from the unknownmarginal distribution p(x){x1, x2, . . . , xu}.The goal is to infer properties of this probability density.In low-dimension many nonparametric methods allow di-rect estimation of p(x) itself. Owing to the curse of di-mensionality, this methods fail in high dimension.One must settle for estimation of crude global models.Unsupervised learning (cont.)Different types of simple descriptive statistics that characterize aspectsof p(x)• mixture modellingrepresentation of p(x) by a mixture of simple densities representingdifferent types or classes of observations [eg. Gaussian mixtures]• combinatorial clusteringattempt to find multiple regions of X that contain modes of X[eg. K-Means]• dimensionality reductionattempt to identify low-dimensional manifolds in X that representhigh data density [eg. ISOMAP,HLLE, Laplacian Eigenmaps]• manifold learningattempt to determine very specific geometrical or topological in-variants of p(x) [eg. Homology learning]Limited formalizationWith supervised and semi-supervised learning there is aclear measure of effectiveness of different methods. Theexpected loss of various estimators I[fS] can be estimatedon validation set.In the context of unsupervised learning, it is difficult tofind such a direct measure of success.This situation has led to proliferation of proposed meth-ods.Mixture ModellingAssumption that data is i.i.d. sampled from s ome proba-bility distribution p(x).p(x) is modelled as a mixture of component density func-tions, each component corresponding to a cluster or mode.The free parameters of the model are fit to the data bymaximum likelihood.Gaussian MixturesWe first choose a parametric mod el Pθfor the unknowndensity p(x), hence maximize the likelihood of our datarelative to the parameters θ.Example: two-component gaussian mixture model with pa-rametersθ = (π, µ1, Σ1, µ2, Σ2).The model:Pθ(x) = (1 − π)GΣ1(x − µ1) + πGΣ2(x − µ2)Maximize the log-likelihood`(θ|{x1, . . . , xu}) =uXi=1log Pθ(xi)The EM algorithmMaximization of `(θ|{x1, . . . , xu}) is a difficult problem. Iterative max-imization strategies, as the EM algorithm, can be used in practice toget local maxima.1. Expectation: compute the responsibilitiesγi=πGΣ2(xi− µ2)(1 − π)GΣ1(xi− µ1) + πGΣ2(xi− µ2)2. Maximization: compute means and variancesµ2=PiγixiPiγi, Σ2=Piγi(xi− µ2)(xi− µ2)TPiγi, etcand the mixing probability π =1uPiγi.3. Iterate until convergenceCombinatorial ClusteringAlgorithms in this class work on the data without any ref-erence to an underlying probability model.The goal is assigning each data point xito a cluster kbelonging a predefined set {1, 2, . . . , K}C(i) = k, i = 1, 2, . . . , uThe optimal encoder C∗(i) minimizes the overall dissimi-larities d(xi, xj) between points xi, xjassigned t o the sameclusterW (C) =12KXk=1XC(i)=kXC(j)=kd(xi, xj)The simplest choice for the dissimilarity d(·, ·) is the squaredEuclidean distance in XCombinatorial Clustering (cont.)The minimization of the within-cluster point scatt er W (C)is straightforward in principle, but...the number of distinct assignments grows exponentiallywith the number of data points uS(u, K) =1K!KXk=1(−1)K−kKkkualready S(19, 4) ' 1010!In practice, clustering algorithms look for good suboptimalsolutions.Most popular algorithms are based on iterative descentstrategies. Convergence to local optima.K-MeansIf d(xi, xj) = kxi−xjk2, introducing the mean vectors ¯xkas-sociated to the k-th cluster, the within-cluster point scatterW (C) can be rewritten asW (C) =12KXk=1XC(i)=kXC(j)=kkxi−xjk2=KXk=1XC(i)=kkxi−¯xkk2.Exploiting this representation one can easily verify that theoptimal encoder C∗is the solution of the enlarged mini-mization problemminC,(m1,...,mK)KXk=1XC(i)=kkxi− mkk2.K-Means (cont.)K-Means attempts the minimization of the enlarged problem by an it-erative alternating procedure. Each step 1 and 2 reduces the objectivefunction, so convergence is assured.1. minimization with respect to (m1, . . . , mK), gettingmk= ¯xk2. minimization with respect to C, gettingC(i) = arg min1≤k≤Kkxi− mkk3. do until C does not changeOne should compare solutions derived from different initial randommeans, and choose best local minimum.Voronoi tessellationDimensionality reductionOften reducing the dimensionality of a problem is an ef-fective preliminary step toward the actual solution of aregression or classification problem.We look for a mapping Φ from the high dimensional spaceIRDto the low dimensional space IRdwhich preserves somerelevant geometrical structure of our problem.Dimensionality reductionPrincipal Component Analysis (PCA)Trying to approximate data {x1, . . . , xu} in IRDby a d-dimensional hyperplaneH = {c + Vθ|θ ∈ IRd}c vector in IRD, θ coordinates vector in IRdand V =(v1, . . . , vd), D × d matrix with {vi} orthonormal systemof vectors in IRD.Problem: find H which minimizes sum of squared dis-tances of data points xifrom HH∗= arg minHuXi=1kxi− PH(xi)k2Linear approximationPCA: the algorithm1. center data points:Pui=1xi= 02. define u × D matrix X = (x1, . . . , xu)T3. construct singular value decomposition X = UΣWT• D × D matrix W = (w1, . . . , wD), with {wi} right eigenvectorsof X• u ×D matrix U = (u1, . . . , uD), with {ui} left eigenvectors of X• D ×D matrix Σ = diag(σ1, . . . , σD), with σ1≥ σ2≥ ··· ≥ σD≥ 0singular eigenvectors of X4. answer: V = (w1, . . . , wd)Sketch of proof• Rewrite the minimization problemminc,V,{θi}uXi=1kxi− c − V θik2• Centering and minimizing with respect to c and θigivesc = 0, θi= VTxi• Plugging into the minimization problemarg minVuXi=1kxi− VVTxik2= arg maxVuXi=1xTiVVTxi= arg maxVdXj=1vTjXTXvjhence (v1, . . . , vd) are the first d eigenvectors of XTX: (w1, . . . , wd)Mercer’s TheoremConsider the pd kernel K(x, x0) on X ×X , and the proba-bility distribution p(x) on X.Define the integral operator LK(LKf)(x) =ZXK(x, x0)f(x0)dp(x0).Mercer’s Theorem states thatK(x, x0) =Xiλiφi(x)φi(x0)where (λi, φi)iis the eigensystem of LK.Feature MapFrom Mercer’s Theorem, the mapping Φ defined over XΦ(x) = (qλ1φ1(x),qλ2φ2(x), . . . )is such thatK(x, x0) =
View Full Document