DOC PREVIEW
BU CS 565 - Clustering V

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1OutlineCluster ValidityClusters found in Random DataUse the objective function FEvaluating clustersEvaluating clusterings using cluster cohesiveness?Cluster separation?Silhouette CoefficientFinal Comment on Cluster ValiditySlide 11ExamplesExamplesClassical methods – Hypothesis testingClassical methods – Hypothesis testingClassical methods: testing independenceTesting independence using χ2Classical methods – Hypothesis testingProblems with classical methodsRandomization methodsRandomization methodsRandomization for testing independenceRandomization methods for other tasksColumnwise randomization: no global view of the dataColumnwise randomization: no global view of the dataQuestionsWhat is a good way of randomizing the data?Swap randomizationBasic ideaFixed marginsExampleSwap randomization and clusteringClustering VOutline•Validating clustering results•Randomization testsCluster Validity •All clustering algorithms provided with a set of points output a clustering•How to evaluate the “goodness” of the resulting clusters?•Tricky because “clusters are in the eye of the beholder”! •Then why do we want to evaluate them?–To compare clustering algorithms–To compare two sets of clusters–To compare two clusters–To decide whether there is noise in the dataClusters found in Random Data0 0.2 0.4 0.6 0.8 100.10.20.30.40.50.60.70.80.91xyRandom Points0 0.2 0.4 0.6 0.8 100.10.20.30.40.50.60.70.80.91xyK-means0 0.2 0.4 0.6 0.8 100.10.20.30.40.50.60.70.80.91xyDBSCAN0 0.2 0.4 0.6 0.8 100.10.20.30.40.50.60.70.80.91xyComplete LinkUse the objective function F•Dataset X, Objective function F•Algorithms: A1, A2,…Ak •Question: Which algorithm is the best for this objective function?•R1 = A1(X), R2 = A2(X),…,Rk=Ak(X)•Compare F(R1), F(R2),…,F(Rk)Evaluating clusters•Function H computes the cohesiveness of a cluster (e.g., smaller values larger cohesiveness)•Examples of cohesiveness?•Goodness of a cluster c is H(c)•c is better than c’ if H(c) < H(c’)Evaluating clusterings using cluster cohesiveness?•For a clustering C consisting of k clusters c1,…,ck•H(C) = Φi H(ci)•What is Φ ?Cluster separation?•Function S that measures the separation between two clusters ci, cj•Ideas for S(ci,cj)? •How can we measure the goodness of a clustering C = {c1,…,ck} using the separation function S?•Silhouette Coefficient combines ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings•For an individual point, I–a = average distance of i to the points in the same cluster–b = min (average distance of i to points in another cluster)–silhouette coefficient of i: s = 1 – a/b if a < b–Typically between 0 and 1. –The closer to 1 the better.•Can calculate the Average Silhouette width for a cluster or a clusteringSilhouette Coefficientab“The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”Algorithms for Clustering Data, Jain and DubesFinal Comment on Cluster ValidityAssessing the significance of clustering (and other data mining) results•Dataset X and algorithm A•Beautiful result A(D)•But: what does it mean?•How to determine whether the result is really interesting or just due to chance?Examples•Pattern discovery: frequent itemsets or association rules•From data X we can find a collection of nice patterns•Significance of individual patterns is sometimes straightforward to test•What about the whole collection of patterns? Is it surprising to see such a collection?Examples•In clustering or mixture modeling: we always get a result•How to test if the whole idea of components/clusters in the data is good?•Do they really exist clusters in the data?Classical methods – Hypothesis testing•Example: Two datasets of real numbers X and Y (|X|=|Y|=n)•Question: Are the means of X and Y (resp. E(X), E(Y)) are significantly different•Test statistic: t = (E(X) – E(Y))/s, (s: an estimate of the standard deviation)•The test statistic follows (under certain assumptions) the t distribution with 2n-2 degrees of freedomClassical methods – Hypothesis testing•The result can be something like: “the difference in the means is significant at the level of 0.01”•That is, if we take two samples of size n, such a difference would occur by chance only in about 1 out of 100 trials•Problems:–What if we are testing many hypotheses (multiple hypotheses testing)–What if there is no closed form available?Classical methods: testing independence•Are columns X and Y independent?•Independence: Pr(X,Y) = Pr(X)*Pr(Y)•Pr(X=1) = 8/11, Pr(X=0)=3/11, Pr(Y=1) = 8/11, Pr(Y=0) = 3/11•Actual joint probabilities: Pr(X=1,Y=1) = 6/11, Pr(X=1,Y=0)=2/11, Pr(X=0,Y=1) = 2/11, Pr(X=0.Y=0)=1/11•Expected joint probabilities: Pr(X=1,Y=1) = 64/121, Pr(X=1,Y=0)=24/121, Pr(X=0,Y=1) = 24/121, Pr(X=0,Y=0)=9/121Testing independence using χ2 •Are columns X and Y independent?Y=1 Y=0 ∑rowX=1 6 2 8Y=0 2 1 3∑column 8 3 11 2}1,0{},{22],[),(],[yxyYxXEyYxXOyYxXE•So what?Classical methods – Hypothesis testing•The result can be something like: “the independence between X and Y is significant at the level of 0.01”•That is, if we take two columns X and Y with the observed P(X=1) and P(Y=1) and n rows, such degree of independence would occur by chance only in about 1 out of 100 trialsProblems with classical methods•What if we are testing many hypotheses (multiple hypotheses testing)•What if there is no closed form available?Randomization methods•Goal: assessing the significance of results–Could the result have occurred by chance?•Methodology: create datasets that somehow reflect the characteristics of the true dataRandomization methods•Create randomized versions from the data X•X1, X2,…,Xk•Run algorithm A on these, producing results A(X1), A(X2),…,A(Xk)•Check if the result A(X) on the real data is somehow different from these•Empirical p-value: the fraction of cases for which the result on real data is (say) larger than A(X)•If the empirical p-value is small, then there is something interesting in the dataRandomization for testing independence•Px = Pr(X=1) and Py = Pr(Y=1)•Generate random instances of


View Full Document

BU CS 565 - Clustering V

Documents in this Course
Load more
Download Clustering V
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Clustering V and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering V 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?