DOC PREVIEW
MSU CSE 802 - Constrained K Means

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Constrained K-means Clustering with Background KnowledgeKiri Wagstaff [email protected] Cardie [email protected] of Computer Science, Cornell University, Ithaca, NY 14853 USASeth Rogers [email protected] Schroedl [email protected] Research and Technology Center, 1510 Page Mill Road, Palo Alto, CA 94304 USAProceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584.AbstractClustering is traditionally viewed as an un-supervised method for data analysis. How-ever, in some cases information about theproblem domain is available in addition tothe data instances themselves. In this paper,we demonstrate how the popular k-meansclustering algorithm can be profitably modi-fied to make use of this information. In ex-periments with artificial constraints on sixdata sets, we observe improvements in clus-tering accuracy. We also apply this methodto the real-world problem of automaticallydetecting road lanes from GPS data and ob-serve dramatic increases in performance.1. IntroductionClustering algorithms are generally used in an un-supervised fashion. They are presented with a set ofdata instances that must be grouped according to somenotion of similarity. The algorithm has access only tothe set of features describing each object; it is not givenany information (e.g., labels) as to where each of theinstances should be placed within the partition.However, in real application domains, it is often thecase that the experimenter possesses some backgroundknowledge (about the domain or the data set) thatcould be useful in clustering the data. Traditional clus-tering algorithms have no way to take advantage of thisinformation even when it does exist.We are therefore interested in ways to integrate back-ground information into clustering algorithms. Wehave previously had success with a modified version ofCOBWEB (Fisher, 1987) that uses background infor-mation about pairs of instances to constrain their clus-ter placement (Wagstaff & Cardie, 2000). K-meansis another popular clustering algorithm that has beenused in a variety of application domains, such as imagesegmentation (Marroquin & Girosi, 1993) and infor-mation retrieval (Bellot & El-Beze, 1999). Due to itswidespread use, we believe that developing a modifiedversion that can make use of background knowledgecan be of significant use to the clustering community.The major contributions of the current work are two-fold. First, we have developed a k-means variant thatcan incorporate background knowledge in the formof instance-level constraints, thus demonstrating thatthis approach is not limited to a single clustering al-gorithm. In particular, we present our modificationsto the k-means algorithm and demonstrate its perfor-mance on six data sets.Second, while our previous work with COBWEB wasrestricted to testing with random constraints, wedemonstrate the power of this method applied to asignificant real-world problem (see Section 6).In the next section, we provide some background onthe k-means algorithm. Section 3 examines in de-tail the constraints we propose using and presents ourmodified k-means algorithm. Next, we describe ourevaluation methods in Section 4. We present experi-mental results in Sections 5 and 6. Finally, Section 7compares our work to related research and Section 8summarizes our contributions.2. K-means ClusteringK-means clustering (MacQueen, 1967) is a methodcommonly used to automatically partition a data setinto k groups. It proceeds by selecting k initial clustercenters and then iteratively refining them as follows:1. Each instance diis assigned to its closest clustercenter.2. Each cluster center Cjis updated to be the meanof its constituent instances.The algorithm converges when there is no furtherchange in assignment of instances to clusters.In this work, we initialize the clusters using instanceschosen at random from the data set. The data sets weused are composed solely of either numeric featuresor symbolic features. For numeric features, we use aEuclidean distance metric; for symbolic features, wecompute the Hamming distance.The final issue is how to choose k. For data setswhere the optimal value of k is already known (i.e., allof the UCI data sets), we make use of it; for the real-world problem of finding lanes in GPS data, we usea wrapper search to locate the best value of k. Moredetails can be found in Section 6.3. Constrained K-means ClusteringWe now proceed to a discussion of our modificationsto the k-means algorithm. In this work, we focus onbackground knowledge that can be expressed as a setof instance-level constraints on the clustering process.After a discussion of the kind of constraints we areusing, we describe the constrained k-means clusteringalgorithm.3.1 The ConstraintsIn the context of partitioning algorithms, instance-level constraints are a useful way to express a prioriknowledge about which instances should or should notbe grouped together. Consequently, we consider twotypes of pairwise constraints:• Must-link constraints specify that two instanceshave to be in the same cluster.• Cannot-link constraints specify that two in-stances must not be placed in the same cluster.The must-link constraints define a transitive binary re-lation over the instances. Consequently, when makinguse of a set of constraints (of both kinds), we take atransitive closure over the constraints.1The full setof derived constraints is then presented to the clus-tering algorithm. In general, constraints may be de-1Although only the must-link constraints are transitive,the closure is performed over both kinds because, e.g, if dimust link to djwhich cannot link to dk, then we also knowthat dicannot link to dk.Table 1. Constrained K-means Algorithmcop-kmeans(data set D, must-link constraints Con=⊆D × D, cannot-link constraints Con6=⊆ D × D)1. Let C1. . . Ckbe the initial cluster centers.2. For each point diin D, assign it to the closest clusterCjsuch that violate-constraints(di, Cj, Con=,Con6=) is false. If no such cluster exists, fail(return {}).3. For each cluster Ci, update its center by averaging allof the points djthat have been assigned to it.4. Iterate between (2) and (3) until convergence.5. Return {C1. . . Ck}.violate-constraints(data point d, cluster C, must-link constraints Con=⊆ D × D, cannot-link constraintsCon6=⊆ D × D)1. For each (d, d=) ∈ Con=: If d=6∈ C, return true.2. For each (d, d6=) ∈


View Full Document

MSU CSE 802 - Constrained K Means

Download Constrained K Means
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Constrained K Means and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Constrained K Means 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?