KDD --- 2004 LecturesClustering and Similarity AssessmentMotivation: Why Clustering?Examples of Clustering ApplicationsRequirements of Clustering in Data MiningData Structures for ClusteringQuality Evaluation of ClustersChallenges in Obtaining Object Similarity MeasuresCase Study: Patient SimilarityGenerating a Global Similarity Measure from Single Variable Similarity MeasuresA Methodology to Obtain a Similarity MatrixInterval-scaled VariablesNormalization in [0,1]Other NormalizationsSimilarity Between ObjectsSimilarity Between Objects (Cont.)Similarity with respect to a Set of Binary VariablesSimilarity between Binary Variable SetsNominal VariablesOrdinal VariablesRatio-Scaled VariablesCase Study --- NormalizationCase Study --- Weight Selection and Similarity Measure SelectionMajor Clustering ApproachesPartitioning Algorithms: Basic ConceptThe K-Means Clustering MethodSlide 28Comments on the K-Means MethodPAM (Partitioning Around Medoids) (1987)PAM Clustering: Total swapping cost TCih=jCjihCLARANS (“Randomized” CLARA) (1994)Grid-Based Clustering MethodSTING: A Statistical Information Grid ApproachSTING: A Statistical Information Grid Approach (2)STING: A Statistical Information Grid Approach (3)CLIQUE (Clustering In QUEst)CLIQUE: The Major StepsPowerPoint PresentationStrength and Weakness of CLIQUEWork at UH related to Similarity Assessment and ClusteringSlide 42Slide 43Slide 44Research Goals Supervised ClusteringWhat is a good object distance function q for supervised similarity assessment?Idea: Coevolving Clusters and Similarity FunctionsIdea CR*-ApproachWeight Adjustment within a ClusterSummary: Problems and Challenges for ClusteringSummary Object Similarity & ClusteringReferences (1)References (2)1Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340March 4+9: Introduction to KDDMarch 11: Association Rule MiningMarch 23: Similarity Assessment March 25: Clustering and UHDM2March 30: Data Warehouses and OLAPKDD --- 2004 Lectures2Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Clustering andSimilarity Assessment ©Jiawei Han and Micheline Kamberwith major Additions and Modifications by Ch. EickOrganization for COSC 6340:1. What is Clustering?2. Object Similarity Assessment3. K-means/medoid Clustering4. Grid-based Clustering5. Work at UH4Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Motivation: Why Clustering?Problem: Identify (a small number of) groups of similar objects in a given (large) set of object.Goals: Find representatives for homogeneous groups Data Compression Find “natural” clusters and describe their properties ”natural” Data TypesFind suitable and useful grouping ”useful” Data ClassesFind unusual data object Outlier Detection5Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Examples of Clustering ApplicationsPlant/Animal ClassificationBook Ordering Cloth SizesFraud Detection (Find outlier)6Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Requirements of Clustering in Data Mining ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeMinimal requirements for domain knowledge to determine input parametersAble to deal with noise and outliersInsensitive to order of input recordsHigh dimensionalityIncorporation of user-specified constraintsInterpretability and usability7Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Data Structures for ClusteringData matrix(n objects, p attributes)(Dis)Similarity matrix(nxn)npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)08Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Quality Evaluation of ClustersDissimilarity/Similarity metric: Similarity is expressed in terms of a normalized distance function d, which is typically metric; typically: (oi, oj) = 1 - d (oi, oj) There is a separate “quality” function that measures the “goodness” of a cluster.The definitions of similarity functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio-scaled variables.Weights should be associated with different variables based on applications and data semantics.It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.9Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Challenges in Obtaining Object Similarity MeasuresMany Types of VariablesInterval-scaled variablesBinary variables and nominal variablesOrdinal variablesRatio-scaled variablesObjects are characterized by variables belonging to different types (mixture of variables)10Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Case Study: Patient SimilarityThe following relation is given (with 10000 tuples):Patient(ssn, weight, height, cancer-sev, eye-color, age)Attribute Domainsssn: 9 digitsweight between 30 and 650; mweight=158 sweight=24.20height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2cancer-sev: 4=serious 3=quite_serious 2=medium 1=minoreye-color: {brown, blue, green, grey }age: between 3 and 100; mage=45 sage=13.2Task: Define Patient Similarity11Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Generating a Global Similarity Measure from Single Variable Similarity Measures Assumption: A database may contain up to six types of variables: symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.1. Standardize variable and associate similarity measure i with the standardized i-th variable and determine weight wi of the i-th variable.2. Create the following global (dis)similarity measure :ffjifjiwpfwoopfoo1*)(1,),(12Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340A Methodology to Obtain a Similarity Matrix1. Understand Variables 2. Remove (non-relevant and redundant) Variables3. (Standardize and) Normalize Variables (typically using z-scores or variable values are transformed to numbers in [0,1])4. Associate (Dis)Similarity Measure df/f with each Variable5. Associate a Weight (measuring its importance) with each Variable6. Compute the (Dis)Similarity Matrix7. Apply Similarity-based Data Mining Technique
View Full Document