Chapter 5: ClusteringSearching for groupsGroup similar points togetherAn IllustrationExamples of Clustering ApplicationsConcepts of ClusteringClusteringDissimilarity/Distance MeasureTypes of data in clustering analysisInterval-valued variablesSimilarity Between ObjectsSimilarity Between Objects (Cont.)Binary VariablesDissimilarity of Binary VariablesNominal VariablesOrdinal VariablesRatio-Scaled VariablesVariables of Mixed TypesMajor Clustering TechniquesPartitioning Algorithms: Basic ConceptThe K-Means ClusteringExampleComments on K-MeansVariations of the K-Means Methodk-Medoids clustering methodPartition Around MedoidsComments on PAMCLARA: Clustering Large ApplicationsHierarchical ClusteringAgglomerative ClusteringSlide 31Divisive ClusteringMore on Hierarchical MethodsSummaryOther Data Mining MethodsSequence analysisSequence analysis (cont …)Discovering holes in dataData and pattern visualizationScaling up data mining algorithmsUIC - CS 594 1Chapter 5: ClusteringUIC - CS 594 2Searching for groupsClustering is unsupervised or undirected.Unlike classification, in clustering, no pre-classified data.Search for groups or clusters of data points (records) that are similar to one another.Similar points may mean: similar customers, products, that will behave in similar ways.UIC - CS 594 3Group similar points togetherGroup points into classes using some distance measures.Within-cluster distance, and between cluster distanceApplications:As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithmsUIC - CS 594 4An IllustrationUIC - CS 594 5Examples of Clustering ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programsInsurance: Identifying groups of motor insurance policy holders with some interesting characteristics. City-planning: Identifying groups of houses according to their house type, value, and geographical locationUIC - CS 594 6Concepts of ClusteringClustersDifferent ways of representing clustersDivision with boundariesSpheresProbabilisticDendrograms…1 2 3I1I2…In0.5 0.2 0.3UIC - CS 594 7ClusteringClustering qualityInter-clusters distance maximizedIntra-clusters distance minimizedThe quality of a clustering result depends on both the similarity measure used by the method and its application.The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Clustering vs. classificationWhich one is more difficult? Why?There are a huge number of clustering techniques.UIC - CS 594 8Dissimilarity/Distance MeasureDissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d (i, j)The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.Weights should be associated with different variables based on applications and data semantics.It is hard to define “similar enough” or “good enough”. The answer is typically highly subjective.UIC - CS 594 9Types of data in clustering analysisInterval-scaled variablesBinary variablesNominal, ordinal, and ratio variablesVariables of mixed typesUIC - CS 594 10Interval-valued variablesContinuous measurements in a roughly linear scale, e.g., weight, height, temperature, etcStandardize data (depending on applications)Calculate the mean absolute deviation: whereCalculate the standardized measurement (z-score).)...211nffffxx(xn m |)|...|||(|121 fnffffffmxmxmxns ffififsmx zUIC - CS 594 11Similarity Between ObjectsDistance: Measure the similarity or dissimilarity between two data objectsSome popular ones include: Minkowski distance:where (xi1, xi2, …, xip) and (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integerIf q = 1, d is Manhattan distanceqqppqqjxixjxixjxixjid )||...|||(|),(2211||...||||),(2211 ppjxixjxixjxixjid UIC - CS 594 12Similarity Between Objects (Cont.)If q = 2, d is Euclidean distance:Propertiesd(i,j) 0d(i,i) = 0d(i,j) = d(j,i)d(i,j) d(i,k) + d(k,j)Also, one can use weighted distance, and many other similarity/distance measures. )||...|||(|),(2222211 ppjxixjxixjxixjid UIC - CS 594 13Binary VariablesA contingency table for binary dataSimple matching coefficient (invariant, if the binary variable is symmetric):Jaccard coefficient (noninvariant if the binary variable is asymmetric): dcbacb jid),(pdbcasumdcdcbabasum0101cbacb jid),(Object iObject jUIC - CS 594 14Dissimilarity of Binary VariablesExamplegender is a symmetric attribute (not used below)the remaining attributes are asymmetric attributeslet the values Y and P be set to 1, and the value N be set to 0Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N75.021121),(67.011111),(33.010210),(maryjimdjimja ckdmaryjackdUIC - CS 594 15Nominal VariablesA generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green, etcMethod 1: Simple matchingm: # of matches, p: total # of variablesMethod 2: use a large number of binary variablescreating a new binary variable for each of the M nominal statespmpjid),(UIC - CS 594 16Ordinal VariablesAn ordinal variable can be discrete or continuousOrder is important, e.g., rankCan be treated like interval-scaled (f is a variable)replace xif by their ranks map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable bycompute the dissimilarity using methods for interval-scaled variables11fififMrz},...,1{fifMr UIC - CS 594 17Ratio-Scaled VariablesRatio-scaled variable: a measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt, e.g., growth of a bacteria population. Methods:treat them like interval-scaled variables—not a good idea! (why?—the scale can be distorted)apply logarithmic transformationyif = log(xif)treat them as continuous ordinal data and then treat their ranks
View Full Document