Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 8 —Chapter 8. Cluster AnalysisGeneral Applications of ClusteringExamples of Clustering ApplicationsWhat Is Good Clustering?Requirements of Clustering in Data MiningSlide 8Data StructuresMeasure the Quality of ClusteringType of data in clustering analysisInterval-valued variablesSimilarity and Dissimilarity Between ObjectsSimilarity and Dissimilarity Between Objects (Cont.)Binary VariablesDissimilarity between Binary VariablesNominal VariablesOrdinal VariablesRatio-Scaled VariablesVariables of Mixed TypesSlide 21Major Clustering ApproachesSlide 23Partitioning Algorithms: Basic ConceptThe K-Means Clustering MethodSlide 26Comments on the K-Means MethodVariations of the K-Means MethodThe K-Medoids Clustering MethodPAM (Partitioning Around Medoids) (1987)PAM Clustering: Total swapping cost TCih=jCjihCLARA (Clustering Large Applications) (1990)CLARANS (“Randomized” CLARA) (1994)Slide 34Hierarchical ClusteringAGNES (Agglomerative Nesting)PowerPoint PresentationDIANA (Divisive Analysis)More on Hierarchical Clustering MethodsBIRCH (1996)Slide 41CF TreeCURE (Clustering Using REpresentatives )Drawbacks of Distance-Based MethodCure: The AlgorithmData Partitioning and ClusteringCure: Shrinking Representative PointsClustering Categorical Data: ROCKRock: AlgorithmCHAMELEONOverall Framework of CHAMELEONSlide 52Density-Based Clustering MethodsDensity-Based Clustering: BackgroundDensity-Based Clustering: Background (II)DBSCAN: Density Based Spatial Clustering of Applications with NoiseDBSCAN: The AlgorithmOPTICS: A Cluster-Ordering Method (1999)OPTICS: Some Extension from DBSCANSlide 60DENCLUE: using density functionsDenclue: Technical EssenceGradient: The steepness of a slopeDensity AttractorCenter-Defined and ArbitrarySlide 66Grid-Based Clustering MethodSTING: A Statistical Information Grid ApproachSTING: A Statistical Information Grid Approach (2)STING: A Statistical Information Grid Approach (3)WaveCluster (1998)Slide 73What Is Wavelet (2)?QuantizationTransformationSlide 77CLIQUE (Clustering In QUEst)CLIQUE: The Major StepsSlide 80Strength and Weakness of CLIQUESlide 82Model-Based Clustering MethodsCOBWEB Clustering MethodMore on Statistical-Based ClusteringOther Model-Based Clustering MethodsSlide 87Self-organizing feature maps (SOMs)Slide 89What Is Outlier Discovery?Outlier Discovery: Statistical ApproachesOutlier Discovery: Distance-Based ApproachOutlier Discovery: Deviation-Based ApproachSlide 94Problems and ChallengesConstraint-Based Clustering AnalysisSummaryReferences (1)References (2)http://www.cs.sfu.ca/~hanJanuary 15, 2019 Data Mining: Concepts and Techniques1Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 8 —©Jiawei Han and Micheline KamberIntelligent Database Systems Research LabSchool of Computing Science Simon Fraser University, Canadahttp://www.cs.sfu.caJanuary 15, 2019 Data Mining: Concepts and Techniques2Chapter 8. Cluster AnalysisWhat is Cluster Analysis?Types of Data in Cluster AnalysisA Categorization of Major Clustering MethodsPartitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsModel-Based Clustering MethodsOutlier AnalysisSummaryJanuary 15, 2019 Data Mining: Concepts and Techniques4General Applications of Clustering Pattern RecognitionSpatial Data Analysis create thematic maps in GIS by clustering feature spacesdetect spatial clusters and explain them in spatial data miningImage ProcessingEconomic Science (especially market research)WWWDocument classificationCluster Weblog data to discover groups of similar access patternsJanuary 15, 2019 Data Mining: Concepts and Techniques5Examples of Clustering ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programsLand use: Identification of areas of similar land use in an earth observation databaseInsurance: Identifying groups of motor insurance policy holders with a high average claim costCity-planning: Identifying groups of houses according to their house type, value, and geographical locationEarth-quake studies: Observed earth quake epicenters should be clustered along continent faultsJanuary 15, 2019 Data Mining: Concepts and Techniques6What Is Good Clustering?A good clustering method will produce high quality clusters withhigh intra-class similaritylow inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation.The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.January 15, 2019 Data Mining: Concepts and Techniques7Requirements of Clustering in Data Mining ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeMinimal requirements for domain knowledge to determine input parametersAble to deal with noise and outliersInsensitive to order of input recordsHigh dimensionalityIncorporation of user-specified constraintsInterpretability and usabilityJanuary 15, 2019 Data Mining: Concepts and Techniques8Chapter 8. Cluster AnalysisWhat is Cluster Analysis?Types of Data in Cluster AnalysisA Categorization of Major Clustering MethodsPartitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsModel-Based Clustering MethodsOutlier AnalysisSummaryJanuary 15, 2019 Data Mining: Concepts and Techniques9Data StructuresData matrix(two modes)Dissimilarity matrix(one mode)npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)0January 15, 2019 Data Mining: Concepts and Techniques10Measure the Quality of ClusteringDissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j)There is a separate “quality” function that measures the “goodness” of a cluster.The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.Weights should be associated with different variables based on applications and data semantics.It is hard to define
View Full Document