DOC PREVIEW
BU CS 565 - Clustering III

This preview shows page 1-2-3-18-19-36-37-38 out of 38 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Clustering IIILecture outline• Soft (model-based) clustering and EM algorithm• Clustering aggregation [A. Gionis, H. Mannila, P. Tsaparas: Clustering aggregation, ICDE 2004]• Impossibility theorem for clustering [Jon Kleinberg, An impossibility theorem for clustering, NIPS 2002]Expectation-maximization algorithm• Iterative procedure to compute the Maximum Likelihood (ML) estimate – even in the presence of missing or hidden data• EM consists of two steps:– Expectation step: the (missing) data are estimated given the observed data and current estimates of model parameters– Maximization step: The likelihood function is maximized under the assumption that the (missing) data are knownEM algorithm for mixture of Gaussians• What is a mixture of K Gaussians?withand F(x|Θ) is the Gaussian distribution with parameters Θ = {μ,Σ} KkkkxFxp1)|()(Kkk11EM algorithm for mixture of Gaussians• If all points xєX are mixtures of K Gaussians then• Goal: Find π1,…, πkand Θ1,…, Θksuch that P(X) is maximized• Or, ln(P(X)) is maximized: niKkkikniixFxpXp111)|()()(niKkkikxFL1 1)|(ln)(Mixtures of Gaussians -- notes• Every point xiis probabilistically assigned (generated) to (by) the k-th Gaussian• Probability that point xiis generated by the k-th Gaussian isKjjijkikikxFxFw1)|()|(Mixtures of Gaussians -- notes• Every Gaussian (cluster) Ckhas an effective number of points assigned to it Nk• With mean• And varianceniikkwN1niiikkkxwN11niTkiikiikkkxxxwN11EM for Gaussian Mixtures• Initialize the means μk, variances Σk(Θk=(μk,Σk)) and mixing coefficients πk, and evaluate the initial value of the loglikelihood• Expectation step: Evaluate weights KjjijkikikxFxFw1)|()|(EM for Gaussian Mixtures• Maximization step: Re-evaluate parameters• Evaluate L(Θnew) and stop if convergedniiikknewkxwN11niTnewkiinewkiikknewkxxxwN11NNknewkLecture outline• Soft (model-based) clustering and EM algorithm• Clustering aggregation [A. Gionis, H. Mannila, P. Tsaparas: Clustering aggregation, ICDE 2004]• Impossibility theorem for clustering [Jon Kleinberg, An impossibility theorem for clustering, NIPS 2002]Clustering aggregation• Many different clusterings for the same dataset!– Different objective functions– Different algorithms– Different number of clusters• Which clustering is the best?– Aggregation: we do not need to decide, but rather find a reconciliation between different outputsThe clustering-aggregation problem• Input– n objects X = {x1,x2,…,xn}– m clusterings of the objects C1,…,Cm• partition: a collection of disjoint sets that cover X• Output– a single partition C, that is as close as possible to all input partitions• How do we measure closeness of clusterings?– disagreement distanceDisagreement distance• For object x and clustering C, C(x) is the index of set in the partition that contains x• For two partitions C and P, and objects x,y in X define• if IP,Q(x,y) = 1 we say that x,y create a disagreement between partitions P and Q•U C Px11 1x21 2x32 1x43 3x53 4D(P,Q) = 4otherwise0P(y) P(x) AND C(y) C(x) ifORP(y) P(x) and C(y) C(x) if 1y)(x,IPC,y)(x,QP,y)(x,I Q)D(P,Metric property for disagreement distance• For clustering C: D(C,C) = 0• D(C,C’)≥0 for every pair of clusterings C, C’ • D(C,C’) = D(C’,C)• Triangle inequality?• It is sufficient to show that for each pair of points x,yєX: Ix,y(C1,C3)≤ Ix,y(C1,C2) + Ix,y(C2,C3)• Ix,ytakes values 0/1; triangle inequality can only be violated when–Ix,y(C1,C3)=1 and Ix,y(C1,C2) = 0 and Ix,y(C2,C3)=0– Is this possible?Clustering aggregation• Given partitions C1,…,Cmfind C such that is minimizedm1ii)CD(C, D(C)U C1C2C3Cx11 1 1 1x21 2 2 2x32 1 1 1x42 2 2 2x53 3 3 3x63 4 3 3the aggregation costWhy clustering aggregation?• Clustering categorical data• The two problems are equivalentU City Profession Nationalityx1New York Doctor U.S.x2New York Teacher Canadax3Boston Doctor U.S.x4Boston Teacher Canadax5Los Angeles Lawer Mexicanx6Los Angeles Actor MexicanWhy clustering aggregation?• Identify the correct number of clusters– the optimization function does not require an explicit number of clusters• Detect outliers– outliers are defined as points for which there is no consensusWhy clustering aggregation?• Improve the robustness of clustering algorithms– different algorithms have different weaknesses.– combining them can produce a better result.Why clustering aggregation?• Privacy preserving clustering– different companies have data for the same users. They can compute an aggregate clustering without sharing the actual data.Complexity of Clustering Aggregation• The clustering aggregation problem is NP-hard– the median partition problem [Barthelemy and LeClerc 1995].• Look for heuristics and approximate solutions. ALG(I) ≤ c OPT(I)A simple 2-approximation algorithm• The disagreement distance D(C,P) is a metric• The algorithm BEST: Select among the input clusterings the clustering C*that minimizes D(C*).– a 2-approximate solution. Why?A 3-approximation algorithm• The BALLS algorithm: – Select a point x and look at the set of points Bwithin distance ½ of x– If the average distance of x to B is less than ¼ then create the cluster B {p}– Otherwise, create a singleton cluster {p}– Repeat until all points are exhausted• Theorem: The BALLS algorithm has worst-case approximation factor 3Other algorithms• AGGLO: – Start with all points in singleton clusters– Merge the two clusters with the smallest average inter-cluster edge weight– Repeat until the average weight is more than ½• LOCAL: – Start with a random partition of the points – Remove a point from a cluster and try to merge it to another cluster, or create a singleton to improve the cost of aggregation. – Repeat until no further improvements are possibleClustering RobustnessLecture outline• Soft (model-based) clustering and EM algorithm• Clustering aggregation [A. Gionis, H. Mannila, P. Tsaparas: Clustering aggregation, ICDE 2004]• Impossibility theorem for clustering [Jon Kleinberg, An impossibility theorem for clustering, NIPS 2002]General form of impossibility results• Define a set of simple axioms (properties) that a computational task should satisfy• Prove that there does not exist an algorithm that can simultaneously satisfy all the axioms  impossibilityComputational task: clustering• A clustering function operates on a set X


View Full Document

BU CS 565 - Clustering III

Documents in this Course
Load more
Download Clustering III
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Clustering III and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering III 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?