CMSC828G Principles of Data Mining Lecture #9•Today’s Reading:– HMS, chapter 9• Today’s Lecture:–Descriptive Modeling – Clustering Algorithms• Upcoming:– hw2 available on web page this evening– project proposals due 3/12Descriptive Models• model presents the main features of the data, a global summary of the data– cluster analysis– density estimationCluster Analysis• decomposing or partitioning a data set into groups so that – the points in one group are similar to each other – and are as different as possible from the points in other groupsThis isn’t really clustering, it’s just binning the objects…General Applications of Clustering •Pattern Recognition• Spatial Data Analysis – create thematic maps in GIS by clustering feature spaces– detect spatial clusters and explain them in spatial data mining• Image Processing• Economic Science (especially market research)•WWW– Document classification– Cluster Weblog data to discover groups of similar access patternsExamples of Clustering Applications•Marketing:Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs•Land use:Identification of areas of similar land use in an earth observation database• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost•City-planning:Identifying groups of houses according to their house type, value, and geographical location• Earth-quake studies:Observed earth quake epicenters should be clustered along continent faultsExample• households: location, income, number of children, rent/own, crime rate, number of cars• The appropriate clustering will depend on goals:– minimize delivery time ⇒ cluster by location–others?Clustering• decomposing or partitioning a data set into groups so that – the points in one group are similar to each other – and are as different as possible from the points in other groups•Measure of distance is fundamental • Explicit representation: – D(x(i),x(j)) for each x– only feasible for small domains•Measurement:– distance computed from features – we saw a number of different ways of doing this in ch. 2Clustering• Huge body of work • (aka unsupervised learning, segmentation, …)• One of the major difficulties is in evaluating the success of a method• validity depends on goals• if goal is to find ‘interesting’ clusters, this is rather difficult to quantify• however, for our probabilistic methods, we will present some tools for validating our modelsChoosing an Algorithm• As we will see, different algorithms will result in clusters of different ‘shapes’• The appropriate shape will depend on the application and should be consider when choosing an algorithm• match method to objectivesFamilies of Clustering Algorithms• Partition-based methods – e.g., K-means• Hierarchical clustering – e.g., hierarchical agglomerative clustering• Probabilistic model-based clustering – e.g., mixture modelsPartition-based Clustering Algorithms• Given set of n data points D={x(1), …, x(n)}partition data into k clusters C = {C1, …, Ck}such that each x(i) is assigned to a unique Cjand Score(C,D) is minimized/maximized• combinatorial optimization: searching for allocation of n objects into k classes that maximizes score function• Number of possible allocations ≈ nk• exhaustive typically finding the optimal solution is intractable• Resort to iterative improvementScore Function• Score function: – clusters compact ⇒ minimize within cluster distance, wc(C)– clusters should be far apart ⇒ maximize distance between clusters, bc(C)• Given a clustering C, assign cluster centers, ck– if points belong to space where means make sense, we can use the centroid of the points in the cluster:∑∈=kCxkkxn1c• wc(C) = sum-of-squares within cluster distance)c,x(d)C(wc)C(wckK1kCxK1kkk∑∑∑=∈===• bc(C) = distance between clusters∑≤<≤=Kkj12kj)c,c(d)C(bc• Score(C,D) = f(wc(C), bc(C))K-means•Idea: – Start with randomly chosen cluster centers– Assign points to give greatest increase in score– Recompute cluster centers–Reassign points– Repeat until no changesK-means exampleX(2)X(3)X(5)X(1)X(6)X(7)X(8)X(4)K-means exampleX(2)X(3)X(5)X(1)X(6)X(7)X(8)X(4)c1c3c2K-means exampleX(2)X(3)X(5)X(1)X(6)X(7)X(8)X(4)c1c3c2K-means exampleX(2)X(3)X(5)X(1)X(6)X(7)X(8)X(4)c1c3c2K-means exampleX(2)X(3)X(5)X(1)X(6)X(7)X(8)X(4)c1c3c2K-means exampleX(2)X(3)X(5)X(1)X(6)X(7)X(8)X(4)c1c3c2K-means example #2X(2)X(3)X(5)X(1)X(6)X(7)X(8)X(4)K-means example #2X(2)X(3)X(5)X(1)X(6)X(7)X(8)X(4)c1c3c2K-means example #2X(2)X(3)X(5)X(1)X(6)X(7)X(8)X(4)c1c3c2Demosk-means appletimage exampleanother demoComplexity• Does algorithm terminate?• Does algorithm converge to optimal solution?• Time complexity one iteration? nkAlgorithm Variations• recompute centroid as soon as a point is reassigned• allow merge and split of clusters• methods for improving solution accuracy?• in cases where means do not make sense– k-mediods – use one of the data points as center– categorical data -• what if data set is too large for algorithm to be tractable?– compress data by replacing groups of objects by ‘condensed representation’Binary Variables• A contingency table for binary data• Simple matching coefficient (invariant, if the binary variable is symmetric):• Jaccard coefficient (noninvariant if the binary variable is asymmetric): dcbacb jid++++=),(pdbcasumdcdcbabasum++++0101cbacb jid+++=),(Object iObject jDissimilarity between Binary Variables•Example– attributes are asymmetric binary– let the values Y and P be set to 1, and the value N be set to 0Name Fever Cough Test-1 Test-2 Test-3 Test-4 Jack Y N P N N N Mary Y N P N P N Jim Y P N N N N 75.021121),(67.011111),(33.010210),(=+++==+++==+++=maryjimdjimjackdmaryjackdNominal Variables• A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green• Method 1: Simple matching–m: # of matches,p: total # of variables• Method 2: use a large number of binary variables– creating a new binary variable for each of the Mnominal statespmpjid−=),(Hierarchical Clustering• rather than deciding the number of clusters K at the start, build a hierarchy of nested clusters• either gradually– merge points (agglomerative)– divide
View Full Document