UH COSC 6340 - Clustering and Object Similarity Evaluation - D2225903

Home> Schools> University of Houston> Computer Science (COSC) > COSC 6340> Clustering and Object Similarity Evaluation

DOC PREVIEW

UH COSC 6340 - Clustering and Object Similarity Evaluation

School name University of Houston

Course Cosc 6340- Database Systems

Pages 35

This preview shows page 1-2-16-17-18-34-35 out of 35 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Clustering and Object Similarity EvaluationExamples of Clustering ApplicationsRequirements of Clustering in Data MiningData Structures for ClusteringQuality Evaluation of ClustersChallenges in Obtaining Object Similarity MeasuresCase Study: Patient SimilarityGenerating a Global Similarity Measure from Single Variable Similarity MeasuresA Methodology to Obtain a Similarity MatrixInterval-scaled VariablesNormalization in [0,1]Other NormalizationsSimilarity Between ObjectsSimilarity Between Objects (Cont.)Similarity with respect to a Set of Binary VariablesSimilarity between Binary Variable SetsNominal VariablesOrdinal VariablesRatio-Scaled VariablesCase Study --- NormalizationCase Study --- Weight Selection and Similarity Measure SelectionMajor Clustering ApproachesPartitioning Algorithms: Basic ConceptThe K-Means Clustering MethodSlide 26Comments on the K-Means MethodHierarchical ClusteringPowerPoint PresentationSlide 30Slide 31Self-organizing feature maps (SOMs)Problems and ChallengesSummary Object Similarity & ClusteringReferences (1)References (2)1Han, Kamber, Eick: Object Similarity & ClusteringClustering andObject Similarity Evaluation ©Jiawei Han and Micheline Kamberwith Additions and Modifications by Ch. EickOrganization for COSC 6340:1. What is Clustering?2. Object Similarity Measurement3. K-Means Clustering Algorithm4. Hierarchical Clustering3Han, Kamber, Eick: Object Similarity & ClusteringExamples of Clustering ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programsLand use: Identification of areas of similar land use in an earth observation databaseInsurance: Identifying groups of motor insurance policy holders with a high average claim costCity-planning: Identifying groups of houses according to their house type, value, and geographical locationEarth-quake studies: Observed earth quake epicenters should be clustered along continent faults4Han, Kamber, Eick: Object Similarity & ClusteringRequirements of Clustering in Data Mining ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeMinimal requirements for domain knowledge to determine input parametersAble to deal with noise and outliersInsensitive to order of input recordsHigh dimensionalityIncorporation of user-specified constraintsInterpretability and usability5Han, Kamber, Eick: Object Similarity & ClusteringData Structures for ClusteringData matrix(n objects, p attributes)(Dis)Similarity matrix(nxn)npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)06Han, Kamber, Eick: Object Similarity & ClusteringQuality Evaluation of ClustersDissimilarity/Similarity metric: Similarity is expressed in terms of a normalized distance function d, which is typically metric; typically:  (oi, oj) = 1 - d (oi, oj) There is a separate “quality” function that measures the “goodness” of a cluster.The definitions of similarity functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.Weights should be associated with different variables based on applications and data semantics.It is hard to define “similar enough” or “good enough”  the answer is typically highly subjective.7Han, Kamber, Eick: Object Similarity & ClusteringChallenges in Obtaining Object Similarity MeasuresMany Types of VariablesInterval-scaled variablesBinary variables and nominal variablesOrdinal variablesRatio-scaled variablesObjects are characterized by variables belonging to different types (mixture of variables)8Han, Kamber, Eick: Object Similarity & ClusteringCase Study: Patient SimilarityThe following relation is given (with 10000 tuples):Patient(ssn, weight, height, cancer-sev, eye-color, age)Attribute Domainsssn: 9 digitsweight between 30 and 650; mweight=158 sweight=24.20height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2cancer-sev: 4=serious 3=quite_serious 2=medium 1=minoreye-color: {brown, blue, green, grey }age: between 3 and 100; mage=45 sage=13.2Task: Define Patient Similarity9Han, Kamber, Eick: Object Similarity & ClusteringGenerating a Global Similarity Measure from Single Variable Similarity Measures Assumption: A database may contain up to six types of variables: symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.1. Standardize variable and associate similarity measure i with the standardized i-th variable and determine weight wi of the i-th variable.2. Create the following global (dis)similarity measure :ffjifjiwpfwoopfoo1*)(1,),(10Han, Kamber, Eick: Object Similarity & ClusteringA Methodology to Obtain a Similarity Matrix1. Understand Variables 2. Remove (non-relevant and redundant) Variables3. (Standardize and) Normalize Variables (typically using z-scores or variable values are transformed to numbers in [0,1])4. Associate (Dis)Similarity Measure df/f with each Variable5. Associate a Weight (measuring its importance) with each Variable6. Compute the (Dis)Similarity Matrix7. Apply Similarity-based Data Mining Technique (e.g. Clustering, Nearest Neighbor, Multi-dimensional Scaling,…)11Han, Kamber, Eick: Object Similarity & ClusteringInterval-scaled VariablesStandardize data using z-scoresCalculate the mean absolute deviation:whereCalculate the standardized measurement (z-score)Using mean absolute deviation is more robust than using standard deviation .)...211nffffxx(xn m|)|...|||(|121 fnffffffmxmxmxns ffififsmx z12Han, Kamber, Eick: Object Similarity & ClusteringNormalization in [0,1]Problem: If non-normalized variables are used the maximum distance between two values can be greater than 1. Solution: Normalize interval-scaled variables usingwhere minf denotes the minimum value and maxf denotes the maximum value of the f-th attribute in the data set and is constant that is choses depending on the similarity measure (e.g. if Manhattan distance is used  is chosen to be 1).)*)min/((max)min(fffififzx 13Han, Kamber, Eick: Object Similarity & ClusteringOther NormalizationsGoal:Limit the maximum distance to 1Start using a

View Full Document