MIT 9 520 - Unsupervised Learning Techniques - D1740057

Home> Schools> Massachusetts Institute of Technology> Brain and Cognitive Sciences (9) > 9 520> Unsupervised Learning Techniques

DOC PREVIEW

MIT 9 520 - Unsupervised Learning Techniques

School name Massachusetts Institute of Technology

Course 9 520- Statistical Learning Theory and Applications

Pages 33

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Unsupervised Learning Techniques9.520 Class 07, 1 March 2006Andrea CaponnettoAbout this classGoal To introduce some methods for unsupervised learn-ing: Gaussian Mixtures, K-Means, ISOMAP, HLLE,Laplacian Eigenmaps.Unsupervised learningOnly u i.i.d. samples drawn on X from the unknownmarginal distribution p(x){x1, x2, . . . , xu}.The goal is to infer properties of this probability density.In low-dimension many nonparametric methods allow di-rect estimation of p(x) itself. Owing to the curse of di-mensionality, this methods fail in high dimension.One must settle for estimation of crude global models.Unsupervised learning (cont.)Different types of simple descriptive statistics that characterize aspectsof p(x)• mixture modellingrepresentation of p(x) by a mixture of simple densities representingdifferent types or classes of observations [eg. Gaussian mixtures]• combinatorial clusteringattempt to find multiple regions of X that contain modes of X[eg. K-Means]• dimensionality reductionattempt to identify low-dimensional manifolds in X that representhigh data density [eg. ISOMAP,HLLE, Laplacian Eigenmaps]• manifold learningattempt to determine very specific geometrical or topological in-variants of p(x) [eg. Homology learning]Limited formalizationWith supervised and semi-supervised learning there is aclear measure of effectiveness of different methods. Theexpected loss of various estimators I[fS] can be estimatedon validation set.In the context of unsupervised learning, it is difficult tofind such a direct measure of success.This situation has led to proliferation of proposed meth-ods.Mixture ModellingAssumption that data is i.i.d. sampled from s ome proba-bility distribution p(x).p(x) is modelled as a mixture of component density func-tions, each component corresponding to a cluster or mode.The free parameters of the model are fit to the data bymaximum likelihood.Gaussian MixturesWe first choose a parametric mod el Pθfor the unknowndensity p(x), hence maximize the likelihood of our datarelative to the parameters θ.Example: two-component gaussian mixture model with pa-rametersθ = (π, µ1, Σ1, µ2, Σ2).The model:Pθ(x) = (1 − π)GΣ1(x − µ1) + πGΣ2(x − µ2)Maximize the log-likelihood`(θ|{x1, . . . , xu}) =uXi=1log Pθ(xi)The EM algorithmMaximization of `(θ|{x1, . . . , xu}) is a difficult problem. Iterative max-imization strategies, as the EM algorithm, can be used in practice toget local maxima.1. Expectation: compute the responsibilitiesγi=πGΣ2(xi− µ2)(1 − π)GΣ1(xi− µ1) + πGΣ2(xi− µ2)2. Maximization: compute means and variancesµ2=PiγixiPiγi, Σ2=Piγi(xi− µ2)(xi− µ2)TPiγi, etcand the mixing probability π =1uPiγi.3. Iterate until convergenceCombinatorial ClusteringAlgorithms in this class work on the data without any ref-erence to an underlying probability model.The goal is assigning each data point xito a cluster kbelonging a predefined set {1, 2, . . . , K}C(i) = k, i = 1, 2, . . . , uThe optimal encoder C∗(i) minimizes the overall dissimi-larities d(xi, xj) between points xi, xjassigned t o the sameclusterW (C) =12KXk=1XC(i)=kXC(j)=kd(xi, xj)The simplest choice for the dissimilarity d(·, ·) is the squaredEuclidean distance in XCombinatorial Clustering (cont.)The minimization of the within-cluster point scatt er W (C)is straightforward in principle, but...the number of distinct assignments grows exponentiallywith the number of data points uS(u, K) =1K!KXk=1(−1)K−kKkkualready S(19, 4) ' 1010!In practice, clustering algorithms look for good suboptimalsolutions.Most popular algorithms are based on iterative descentstrategies. Convergence to local optima.K-MeansIf d(xi, xj) = kxi−xjk2, introducing the mean vectors ¯xkas-sociated to the k-th cluster, the within-cluster point scatterW (C) can be rewritten asW (C) =12KXk=1XC(i)=kXC(j)=kkxi−xjk2=KXk=1XC(i)=kkxi−¯xkk2.Exploiting this representation one can easily verify that theoptimal encoder C∗is the solution of the enlarged mini-mization problemminC,(m1,...,mK)KXk=1XC(i)=kkxi− mkk2.K-Means (cont.)K-Means attempts the minimization of the enlarged problem by an it-erative alternating procedure. Each step 1 and 2 reduces the objectivefunction, so convergence is assured.1. minimization with respect to (m1, . . . , mK), gettingmk= ¯xk2. minimization with respect to C, gettingC(i) = arg min1≤k≤Kkxi− mkk3. do until C does not changeOne should compare solutions derived from different initial randommeans, and choose best local minimum.Voronoi tessellationDimensionality reductionOften reducing the dimensionality of a problem is an ef-fective preliminary step toward the actual solution of aregression or classification problem.We look for a mapping Φ from the high dimensional spaceIRDto the low dimensional space IRdwhich preserves somerelevant geometrical structure of our problem.Dimensionality reductionPrincipal Component Analysis (PCA)Trying to approximate data {x1, . . . , xu} in IRDby a d-dimensional hyperplaneH = {c + Vθ|θ ∈ IRd}c vector in IRD, θ coordinates vector in IRdand V =(v1, . . . , vd), D × d matrix with {vi} orthonormal systemof vectors in IRD.Problem: find H which minimizes sum of squared dis-tances of data points xifrom HH∗= arg minHuXi=1kxi− PH(xi)k2Linear approximationPCA: the algorithm1. center data points:Pui=1xi= 02. define u × D matrix X = (x1, . . . , xu)T3. construct singular value decomposition X = UΣWT• D × D matrix W = (w1, . . . , wD), with {wi} right eigenvectorsof X• u ×D matrix U = (u1, . . . , uD), with {ui} left eigenvectors of X• D ×D matrix Σ = diag(σ1, . . . , σD), with σ1≥ σ2≥ ··· ≥ σD≥ 0singular eigenvectors of X4. answer: V = (w1, . . . , wd)Sketch of proof• Rewrite the minimization problemminc,V,{θi}uXi=1kxi− c − V θik2• Centering and minimizing with respect to c and θigivesc = 0, θi= VTxi• Plugging into the minimization problemarg minVuXi=1kxi− VVTxik2= arg maxVuXi=1xTiVVTxi= arg maxVdXj=1vTjXTXvjhence (v1, . . . , vd) are the first d eigenvectors of XTX: (w1, . . . , wd)Mercer’s TheoremConsider the pd kernel K(x, x0) on X ×X , and the proba-bility distribution p(x) on X.Define the integral operator LK(LKf)(x) =ZXK(x, x0)f(x0)dp(x0).Mercer’s Theorem states thatK(x, x0) =Xiλiφi(x)φi(x0)where (λi, φi)iis the eigensystem of LK.Feature MapFrom Mercer’s Theorem, the mapping Φ defined over XΦ(x) = (qλ1φ1(x),qλ2φ2(x), . . . )is such thatK(x, x0) =

View Full Document