Clustering and Object Similarity EvaluationExamples of Clustering ApplicationsRequirements of Clustering in Data MiningData Structures for ClusteringQuality Evaluation of ClustersChallenges in Obtaining Object Similarity MeasuresCase Study: Patient SimilarityGenerating a Global Similarity Measure from Single Variable Similarity MeasuresA Methodology to Obtain a Similarity MatrixInterval-scaled VariablesNormalization in [0,1]Other NormalizationsSimilarity Between ObjectsSimilarity Between Objects (Cont.)Similarity with respect to a Set of Binary VariablesSimilarity between Binary Variable SetsNominal VariablesOrdinal VariablesRatio-Scaled VariablesCase Study --- NormalizationCase Study --- Weight Selection and Similarity Measure SelectionMajor Clustering ApproachesPartitioning Algorithms: Basic ConceptThe K-Means Clustering MethodSlide 26Comments on the K-Means MethodHierarchical ClusteringPowerPoint PresentationSlide 30Slide 31Self-organizing feature maps (SOMs)Problems and ChallengesSummary Object Similarity & ClusteringReferences (1)References (2)1Han, Kamber, Eick: Object Similarity & ClusteringClustering andObject Similarity Evaluation ©Jiawei Han and Micheline Kamberwith Additions and Modifications by Ch. EickOrganization for COSC 6340:1. What is Clustering?2. Object Similarity Measurement3. K-Means Clustering Algorithm4. Hierarchical Clustering3Han, Kamber, Eick: Object Similarity & ClusteringExamples of Clustering ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programsLand use: Identification of areas of similar land use in an earth observation databaseInsurance: Identifying groups of motor insurance policy holders with a high average claim costCity-planning: Identifying groups of houses according to their house type, value, and geographical locationEarth-quake studies: Observed earth quake epicenters should be clustered along continent faults4Han, Kamber, Eick: Object Similarity & ClusteringRequirements of Clustering in Data Mining ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeMinimal requirements for domain knowledge to determine input parametersAble to deal with noise and outliersInsensitive to order of input recordsHigh dimensionalityIncorporation of user-specified constraintsInterpretability and usability5Han, Kamber, Eick: Object Similarity & ClusteringData Structures for ClusteringData matrix(n objects, p attributes)(Dis)Similarity matrix(nxn)npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)06Han, Kamber, Eick: Object Similarity & ClusteringQuality Evaluation of ClustersDissimilarity/Similarity metric: Similarity is expressed in terms of a normalized distance function d, which is typically metric; typically: (oi, oj) = 1 - d (oi, oj) There is a separate “quality” function that measures the “goodness” of a cluster.The definitions of similarity functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.Weights should be associated with different variables based on applications and data semantics.It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.7Han, Kamber, Eick: Object Similarity & ClusteringChallenges in Obtaining Object Similarity MeasuresMany Types of VariablesInterval-scaled variablesBinary variables and nominal variablesOrdinal variablesRatio-scaled variablesObjects are characterized by variables belonging to different types (mixture of variables)8Han, Kamber, Eick: Object Similarity & ClusteringCase Study: Patient SimilarityThe following relation is given (with 10000 tuples):Patient(ssn, weight, height, cancer-sev, eye-color, age)Attribute Domainsssn: 9 digitsweight between 30 and 650; mweight=158 sweight=24.20height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2cancer-sev: 4=serious 3=quite_serious 2=medium 1=minoreye-color: {brown, blue, green, grey }age: between 3 and 100; mage=45 sage=13.2Task: Define Patient Similarity9Han, Kamber, Eick: Object Similarity & ClusteringGenerating a Global Similarity Measure from Single Variable Similarity Measures Assumption: A database may contain up to six types of variables: symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.1. Standardize variable and associate similarity measure i with the standardized i-th variable and determine weight wi of the i-th variable.2. Create the following global (dis)similarity measure :ffjifjiwpfwoopfoo1*)(1,),(10Han, Kamber, Eick: Object Similarity & ClusteringA Methodology to Obtain a Similarity Matrix1. Understand Variables 2. Remove (non-relevant and redundant) Variables3. (Standardize and) Normalize Variables (typically using z-scores or variable values are transformed to numbers in [0,1])4. Associate (Dis)Similarity Measure df/f with each Variable5. Associate a Weight (measuring its importance) with each Variable6. Compute the (Dis)Similarity Matrix7. Apply Similarity-based Data Mining Technique (e.g. Clustering, Nearest Neighbor, Multi-dimensional Scaling,…)11Han, Kamber, Eick: Object Similarity & ClusteringInterval-scaled VariablesStandardize data using z-scoresCalculate the mean absolute deviation:whereCalculate the standardized measurement (z-score)Using mean absolute deviation is more robust than using standard deviation .)...211nffffxx(xn m|)|...|||(|121 fnffffffmxmxmxns ffififsmx z12Han, Kamber, Eick: Object Similarity & ClusteringNormalization in [0,1]Problem: If non-normalized variables are used the maximum distance between two values can be greater than 1. Solution: Normalize interval-scaled variables usingwhere minf denotes the minimum value and maxf denotes the maximum value of the f-th attribute in the data set and is constant that is choses depending on the similarity measure (e.g. if Manhattan distance is used is chosen to be 1).)*)min/((max)min(fffififzx 13Han, Kamber, Eick: Object Similarity & ClusteringOther NormalizationsGoal:Limit the maximum distance to 1Start using a
View Full Document