##
This **preview** shows page *1-2-3*
out of 9 **pages**.

*View Full Document*

End of preview. Want to read all 9 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document**Unformatted text preview:**

Information Theoretical Clustering via Semidefinite ProgrammingMeihong Wang Fei ShaDepartment of Computer ScienceU. of Southern CaliforniaLos Angeles, CA [email protected] of Computer ScienceU. of Southern CaliforniaLos Angeles, CA [email protected] propose techniques of convex optimiza-tion for information theoretical clustering.The clustering objective is to maximize themutual information between data points andcluster assignments. We formulate this prob-lem first as an instance o f max k cut onweighted graphs. We then a pply the tech-nique of semidefinite programming (SDP) re -laxation to obtain a convex SDP pr oblem.We show how the solution of the SDP prob-lem can be further improved with a low-rank re finement heuristic. The low-rank so-lution reveals more clearly the cluster struc-ture of the data. Empirical studies on severaldatasets demonstrate the effectiveness of ourapproach. In particular, the approach out-performs several other clustering algorithmswhen compared on standard evaluation met-rics.1 INTRODUCTIONClustering is an imp ortant problem in machine learn-ing and data mining. The basic setup is to group datapoints into disjoint partitions that optimize some crite-ria. For instance, the technique of K- means minimizesthe sum of pairwise distances between data points inthe same partition. The algorithm iterates betweentwo steps: computing the centroids of existing parti-tions and re-assigning every data point to the partitionwith the nearest centroid. Both steps monotonic allydecrease the optimality measure, and the algorithmconverges to a local optimum.Appearing in Proceedings of the 14thInternational Con-ference on Artificial Intelligence an d Statistics (AISTATS)2011, Fort Lauderdale, FL, USA . Volume 15 of JMLR:W&CP 15. Copyright 2011 by the authors.Many variants to K-means e xist. If data lie on a low-dimensional submanifold, then we can use (geodesic)distances on the manifold instead of E uc lide an dis-tances in the e mbedding space. This leads to the tech-nique of spectral clustering (Ng et al., 2001).It is alsoeasy to see how kernel tricks can be applied to formu-late distances with inner products in nonlinear featurespaces, resulting kernelized K-means.Information theoretic clustering (ITC) has recentlybeen investigated by Faivishevsky a nd Goldberger(2010) as an alternative criterion. The criterion max-imizes the mutual informa tion (MI) between datapoints and their cluster memberships. To over-come the difficulty of estimating MI between high-dimensional variables, ITC uses pairwise distancesbased non-parametric statistics (Wang et al., 2009;Kozachenko and Leonenko, 1987). Maximizing themutual information criterion, however, still remainschallenging as it is a NP-hard combinator optimiza-tion. The earlier work uses a local search procedure,sequentially and greedily re-assigning a data pointfrom its current cluster to a new one. While seeminglyeffective, there is no established theoretical propertieson how well this procedure can achieve.In this paper, we prop ose a new optimization proce-dure for ITC. We first identify the problem as an in-stance o f max k cut on weights graphs. We thenapply s e midefinite programming (SDP) relaxation tofind a pproximate solutions. The relaxed problem isconvex and can be solved efficiently. Furthermore, theSDP-based solutions have a strong theoretical guaran-tee in approximation factors. Empirical studies alsoshow that our approach yields much higher clusteringquality than the heuristic procedure.The rest of the paper is organized as follows. In sec-tion 2, we describe nonparametric techniques of esti-mating entropy and mutual information. In section 3,we describe the idea of information theoretical clus-tering and the process of relaxing it as an instance ofSDP. We discuss rela ted work in section 4. Ex perimen-Information Theoretical Clustering via Semide finite Programmingtal results are prese nted in section 5. We summariz ein section 6 and discuss future directions for research.2 ESTIMATION OF ENTROPYEntropy plays an important role in forming many in-formation theoretical quantities. In this paper, we areinterested in how it is related to I(X; Y ), the mutualinformation between a random variable X and its clus-ter membership Y . Concretely,I(X; Y ) = H(X) − H(X|Y ), (1)where the right-hand-side is the differenc e between theentropy and the conditional entropy of X.For high-dimensional X, estimating entropies fromsamples is a challenging problem. Standard ap-proaches include quantizing/binning X or construct-ing density es tima tors of X. They often do not workwell due to the “curse of dimensio nality”, where thenumber of data points grows exponentially in order toobtain an accurate estimation.There has been a growing interest in using nonpa-rameteric statistics to estimate entropies. Specifically,it was shown that, g iven N samples D = {x1, ...xN}where xi∈ RD, the entropy of X can be estimated byˆHk(X) =DNXilog kxi− x(k)ik22+ const, (2)where x(k)iis the k-th nearest neighbor of xiin D (Wang et al., 2009; Kozachenko and Leo ne nko,1987). The estimator approaches H(X) with a c on-vergence rate of O(1/√N).AveragingˆHk(X) ove r all possible k from 1 to (N −1)leads to a simplified estimator,ˆH(X) =1(N − 1)N−1Xk=1ˆHk(X)=DN(N − 1)Xi6=jlog kxi− xjk22+ const. (3)This estimator was first investigatedin (Faivishevsky a nd Go ldberger, 2009, 2010) and canbe understood intuitively as follows.To estimate the entropy, one would need to obtain anunbiased estimator of −log p(xi) such thatH(X) ≈1NXi−log p(xi). (4)For one-dimensional X, we can approximate p(xi) asa uniform distributio n between x and xj, which givesrise to− log p(xi) ≈ −log1|xi− xj|. (5)Averaging this estimator over all possible xj6=xi, we obtain an estimator in the form ofeq. (3). A detailed derivation of this result is givenin (Faivishevsky and Goldberger, 2010).ˆH(X) is more computationally convenient thanˆHk(X)as it does not need to identify nearest neighbor s. Thus,we fo cus onˆH(X) in the rest of the paper.For the conditional entropy H(X|Y ), we estimate itwith the followingˆH(X|Y ) =Xyˆp(Y = y)ˆH(X, Y = y), (6)where ˆp(Y = y) is the (empirical) prior distributionandˆH(X, Y = y) is the entropy of data samples whosecorres ponding Y is y.Spec ific ally, in the context of clustering, Y stands forcluster memberships. We assume that there are Kclusters, e ach with Nkdata points. The