MSU CSE 802 - Boosting Clustering by Pairwise Constraints - D1651292

Home> Schools> Michigan State University> Computer Science & Engineering (CSE) > CSE 802> Boosting Clustering by Pairwise Constraints

DOC PREVIEW

MSU CSE 802 - Boosting Clustering by Pairwise Constraints

School name Michigan State University

Course Cse 802- Pattern Recognition and Analysis

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

BoostCluster: Boosting Clustering by Pairwise ConstraintsYi Liu, Rong Jin, and Anil K. JainDepartment of Computer Science and EngineeringMichigan State UniversityEast Lansing, MI 48824, USA{liuyi3, rongjin, jain}@cse.msu.eduABSTRACTData clustering is an important task in many disciplines. Alarge number of studies have attempted to improve cluster-ing by using the side information that is often encoded aspairwise constraints. However, these studies focus on design-ing special clustering algorithms that can effectively exploitthe pairwise constraints. We present a boosting frameworkfor data clustering, termed as BoostCluster, that is ableto iteratively improve the accuracy of any given clusteringalgorithm by exploiting the p airwise constraints. The keychallenge in designing a boosting framework for data clus-tering is how to influence an arbitrary clustering algorithmwith the side information since clustering algorithms by defi-nition are unsupervised. The proposed framework addressesthis problem by dynamically generating new data represen-tations at each iteration that are, on the one hand, adaptedto the clustering results at previous iterations by the givenalgorithm, and on the other h and consistent with the givenside information. Our empirical study shows that the pro-p osed boosting framework is effective in improving the per-formance of a number of popular clustering algorithms (K-means, partitional SingleLink, spectral clustering), and itsp erfor mance is comparable to the state-of-the-art algorithmsfor data clustering with side information.Categories and Subject DescriptorsI.5.3 [Clustering]: Algorithms; H.3.3 [Information Searchand Retrieval]: ClusteringGeneral TermsAlgorithmsKeywordsBo osting, Data clustering, Semi-sup ervised learning, Pair-wise constraintsPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.KDD’07, August 12–15, 2007, San Jose, California, USA.Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.1. INTRODUCTIONData clustering, also called unsupervised learning, is oneof the key techniques in data mining that is used to un-derstand and mine the structure of unlabeled data. Theidea of improving clustering by side information, sometimescalled semi-supervised clustering or constrained data cluster-ing, has received significant amount of attention in recentstudies on data clustering. Often, the side information ispresented in the form of pairwise constraints: the must-linkpairs where data points should belong to the same cluster,and the cannot-link pairs where data points should b elongto different clusters. There are two major approaches tosemi-supervised clustering: the constraint-based approachand the approach based on distance metric learning. Thefirst approach employs the side information to restrict thesolution space, and only finds the solution that is consistentwith the pairwise constraints. The second approach firstlearns a distance metric f rom the given pairwise constraints,and computes the pairwise similarity using the learned dis-tance metric. The computed similarity matrix is then usedfor data clustering.Although a large number of studies have been devoted tosemi-supervised clustering, most of them focus on designingsp ecial clustering algorithms that can effectively exploit thepairwise constraints. For instance, algorithms in [4, 5, 23]are designed to improve the probabilistic models for dataclustering by incorporating the pairwise constraints into thepriors of the probabilistic models; the constrained K-meansalgorithm [27] exploits the pairwise constraints by adjust-ing the cluster memberships to be consistent with the givenconstraints. However, it is often the case that we have asp ecific clustering algorithm that is specially designed forthe target domain, and we are interested in improving theaccuracy of this clustering algorithm by the available sideinformation. To this end, we propose a boosting frameworkfor data clustering, termed as BoostCluster, that is ableto improve any given clustering algorithm by the pairwiseconstraints. It is important to note that the proposed algo-rithm does not make any assumption about the underlyingclustering algorithm, and is therefore applicable to any clus-tering algorithm.Although a number of b oosting algorithms (e.g., [10]) haveb een successfully applied to supervised learning, extendingb oosting algorithms to data clustering is significantly morechallenging. The key difficulty is how to influence an ar-bitrary clustering algorithm with the given pairwise con-straints. To overcome this challenge, we propose to encodethe side information into the data representation that is the−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.8−0.6−0.4−0.200.20.40.60.82D Projection of Oringinal Data(a) Input data−4 −3 −2 −1 0 1 2 3 4−1.5−1−0.500.51Data Projected to a 2D Subspace (Iteration 1)(b) Iteration 1−4 −3 −2 −1 0 1 2 3 4−2−1.5−1−0.500.511.52Data Projected to a 2D Subspace (Iteration 2)(c) Iteration 2−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−1−0.8−0.6−0.4−0.200.20.40.60.81Data Projected to a 2D Subspace (Iteration 7)(d) Iteration 7Figure 1: An example illustrating the idea of iter-ative data projections. (a) shows the original datadistribution, projected to the space sp anned by itstwo principal components; (b)-(d) show the data dis-tributions based on the new representations in theprojected subspaces that are generated in iteration1, 2, and 7.input to the clustering algorithm. More specifically, we willfirst find the subspace in which data points of the must-linkpairs are close to each other while data points of the cannot-link pairs are far apart. Then, a new data representation isgenerated by projecting all the data points into the sub-space, and will be used by the given clustering algorithmto find the appropriate cluster assignments. Furthermore,the procedure for identifying the appropriate subspace basedon the remaining unsatisfied constraints, and the procedurefor clustering data points using the newly generated datarepresentation will alternate iteratively till most of the con-straints are satisfied. Although

View Full Document