Data Mining: ClusteringCluster AnalysisGeneral Applications of ClusteringExamples of Clustering ApplicationsWhat Is Good Clustering?Requirements of Clustering in Data MiningSlide 8Data StructuresMeasure the Quality of ClusteringType of data in clustering analysisInterval-valued variablesSimilarity and Dissimilarity Between ObjectsSimilarity and Dissimilarity Between Objects (Cont.)Binary VariablesDissimilarity between Binary VariablesNominal VariablesOrdinal VariablesRatio-Scaled VariablesVariables of Mixed TypesSlide 21Major Clustering ApproachesSlide 23Partitioning Algorithms: Basic ConceptSlide 25Hierarchical ClusteringSlide 27Grid-Based Clustering MethodSTING: A Statistical Information Grid ApproachSTING: A Statistical Information Grid Approach (2)STING: A Statistical Information Grid Approach (3)Slide 32Model-Based Clustering MethodsCOBWEB Clustering MethodMore on Statistical-Based ClusteringOther Model-Based Clustering MethodsSelf-organizing feature maps (SOMs)Slide 38What Is Outlier Discovery?Outlier Discovery: Statistical ApproachesOutlier Discovery: Distance-Based ApproachOutlier Discovery: Deviation-Based ApproachSlide 43SummaryReferences (1)References (2)Data Mining: ClusteringCluster AnalysisWhat is Cluster Analysis?Types of Data in Cluster AnalysisA Categorization of Major Clustering MethodsPartitioning MethodsHierarchical MethodsGrid-Based MethodsModel-Based Clustering MethodsOutlier AnalysisSummaryGeneral Applications of Clustering Pattern RecognitionSpatial Data Analysis create thematic maps in GIS by clustering feature spacesdetect spatial clusters and explain them in spatial data miningImage ProcessingEconomic Science (especially market research)WWWDocument classificationCluster Weblog data to discover groups of similar access patternsExamples of Clustering ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programsLand use: Identification of areas of similar land use in an earth observation databaseInsurance: Identifying groups of motor insurance policy holders with a high average claim costCity-planning: Identifying groups of houses according to their house type, value, and geographical locationEarth-quake studies: Observed earth quake epicenters should be clustered along continent faultsWhat Is Good Clustering?A good clustering method will produce high quality clusters withhigh intra-class similaritylow inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation.The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.Requirements of Clustering in Data Mining ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeMinimal requirements for domain knowledge to determine input parametersAble to deal with noise and outliersInsensitive to order of input recordsHigh dimensionalityIncorporation of user-specified constraintsInterpretability and usabilityCluster AnalysisWhat is Cluster Analysis?Types of Data in Cluster AnalysisA Categorization of Major Clustering MethodsPartitioning MethodsHierarchical MethodsGrid-Based MethodsModel-Based Clustering MethodsOutlier AnalysisSummaryData StructuresData matrix(two modes)Dissimilarity matrix(one mode)npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)0Measure the Quality of ClusteringDissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j)There is a separate “quality” function that measures the “goodness” of a cluster.The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.Weights should be associated with different variables based on applications and data semantics.It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.Type of data in clustering analysisInterval-scaled variables:Binary variables:Nominal, ordinal, and ratio variables:Variables of mixed types:Interval-valued variablesStandardize dataCalculate the mean absolute deviation:whereCalculate the standardized measurement (z-score)Using mean absolute deviation is more robust than using standard deviation .)...211nffffxx(xn m|)|...|||(|121 fnffffffmxmxmxns ffififsmx zSimilarity and Dissimilarity Between ObjectsDistances are normally used to measure the similarity or dissimilarity between two data objectsSome popular ones include: Minkowski distance:where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integerIf q = 1, d is Manhattan distanceqqppqqjxixjxixjxixjid )||...|||(|),(2211||...||||),(2211 ppjxixjxixjxixjid Similarity and Dissimilarity Between Objects (Cont.)If q = 2, d is Euclidean distance:Propertiesd(i,j) 0d(i,i) = 0d(i,j) = d(j,i)d(i,j) d(i,k) + d(k,j)Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures.)||...|||(|),(2222211 ppjxixjxixjxixjid Binary VariablesA contingency table for binary dataSimple matching coefficient (invariant, if the binary variable is symmetric):Jaccard coefficient (noninvariant if the binary variable is asymmetric): dcbacb jid),(pdbcasumdcdcbabasum0101cbacb jid),(Object iObject jDissimilarity between Binary VariablesExamplegender is a symmetric attributethe remaining attributes are asymmetric binarylet the values Y and P be set to 1, and the value N be set to 0Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N75.021121),(67.011111),(33.010210),(maryjimdjimjackdmaryjackdNominal VariablesA generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow,
View Full Document