corrclus-k

Home> Academic Documents> corrclus-k

DOC PREVIEW

corrclus-k

This preview shows page 1-2-3-4-5 out of 15 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

IntroductionPrevious and related workOur resultsNP-hardness of MinDisAgree and MaxAgreePTAS for maximizing agreement with k clustersOverview.Performance analysis of MaxAg(k,) algorithm.PTAS for minimizing disagreements with k clustersIdea behind the algorithm.Performance analysis of the algorithm.Complexity on general graphsCorrelation Clustering with a Fixed Number of Clusters∗Ioannis Giotis†Venkatesan Guruswami∗‡AbstractWe continue the investigation of problems concerning correlation clustering or clustering withqualitative information, which is a clustering formulation that has been studied recently [5, 7,8, 3]. The basic setup here is that we are given as input a complete graph on n nodes (whichcorrespond to nodes to be clustered) whose edges are labeled + (for similar pairs of items) and −(for dissimilar pairs of items). Thus we have only as input qualitative information on similarityand no quantitative distance measure between items. The quality of a clustering is measured interms of its number of agreements, which is simply the number of edges it correctly classifies,that is the sum of number of − edges whose endpoints it places in different clusters plus thenumber of + edges both of whose endpoints it places within the same cluster.In this paper, we study the problem of finding clusterings that maximize the number ofagreements, and the complementary minimization version where we seek clusterings that min-imize the number of disagreements. We focus on the situation when the number of clusters isstipulated to be a small constant k. Our main result is that for every k, there is a polynomialtime approximation scheme for both maximizing agreements and minimizing disagreeme nts.(The problems are NP-hard for every k ≥ 2.) The main technical work is for the minimizationversion, as the PTAS for maximizing agreements follows along the lines of the property testerfor Max k-CUT from [13].In contrast, when the number of clusters is not specified, the problem of minimizing dis-agreements was shown to be APX-hard [7], even though the maximization version admits aPTAS.1 IntroductionIn this work, we investigate problems concerning an appealing formulation of clustering calledcorrelation clustering or clustering using qualitative information that has been studied recently inseveral works, including [6, 17, 5, 7, 8, 3, 2, 4]. The basic setup here is to cluster a collection ofn items given as input only qualitative information concerning similarity between pairs of items;specifically for every pair of items, we are given a (Boolean) label as to whether those items aresimilar or dissimilar. We are not provided with any quantitative information on how differe nt pairsof elements are, as is typically assumed in most clustering formulations. These formulations take asinput a metric on the items and then aim to optimize some function of the pairwise distances of theitems within and across clusters. The objective in our formulation is to produce a partitioning into∗A version of this paper will appear in the Proceedings of the 17th Annual ACM-SIAM Symposium on DiscreteAlgorithms (SODA), January 2006.†Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195.{giotis,venkat}@cs.washington.edu‡Research supported in part by NSF Career Award CCF-0343672.1clusters that places similar objects in the same cluster and dissimilar objects in different clusters,to the e xtent possible.An obvious graph-theoretic formulation of the problem is the following: given a complete graphon n nodes with each edge labeled either “+” (similar) or “−” (dissimilar), find a partitioning ofthe vertices into clusters that agrees as much as possible with the edge labels. The maximizationversion, call it MaxAgree, seeks to maximize the number of agreements: the number of + edgesinside clusters plus the number of − edges across clusters. The minimization version, denotedMinDisAgree, aims to minimize the number of disagreements: the number of − edges withinclusters plus the number of + edges between clusters.In this paper, we study the above problems w hen the maximum number of clusters that weare allowed to use is stipulated to be a fixed constant k. We denote the variants of the aboveproblems that have this constraint as MaxAgree[k] and MinDisAgree[k]. We note that, unlikemost clustering formulations, the MaxAgree and MinDisAgree problems are not trivialized ifwe do not specify the number of clusters k as a parameter — whether the best clustering uses fewor many clusters is automatically dictated by the edge labels. However, the variants we study arealso interesting formulations, which are well-motivated in settings where the numb er of clustersmight be an external constraint that has to be met, even if there are “better” cluste rings (i.e., onewith more agreements) with a different number of clusters. Moreover, the existing algorithms for,say MinDisAgree, cannot be modified in any easy way to output a quality solution with at mostk clusters. Therefore k-clustering variants pose new, non-trivial challenges that require differenttechniques for their solutions.In the above description, we have assumed that every pair of items is labeled as + or − in theinput. In a more general variant, intended to capture situations where the c lass ifier providing theinput might be unable to label certain pairs of elements are similar or dissimilar, the input is anarbitrary graph G together with ± labels on its edges. We can again study the above problemsMaxAgree[k] (resp. MinDisAgree[k]) with the objective being to maximize (resp. minimize)the number of agreements (resp. disagreements) on edges of E (that is, we do not count non-edges of G as either agreements or disagreements). I n situations where we study this more generalvariant, we will refer to these problems as MaxAgree[k] on general graphs and MinDisAgree[k]on general graphs. When we don’t qualify with the phrase “on general graphs”, we will alwaysmean the problems on com plete graphs.Our main result in this paper is a polynomial time approximation scheme (PTAS) for MaxAgree[k]as well as MinDisAgree[k] for k ≥ 2. We now discuss prior work on these problems, followed bya m ore detailed description of results in this paper.1.1 Previous and related workThe above problem seems to have been first considered by Ben-Dor et al. [6] motivated by some com-putational biology questions. Later, Shamir et al. [17]


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5 out of 15 pages.

Please select your school