DOC PREVIEW
UH COSC 6340 - Clustering and Object Similarity Evaluation

This preview shows page 1-2-16-17-18-34-35 out of 35 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Clustering and Object Similarity EvaluationExamples of Clustering ApplicationsRequirements of Clustering in Data MiningData Structures for ClusteringQuality Evaluation of ClustersChallenges in Obtaining Object Similarity MeasuresCase Study: Patient SimilarityGenerating a Global Similarity Measure from Single Variable Similarity MeasuresA Methodology to Obtain a Similarity MatrixInterval-scaled VariablesNormalization in [0,1]Other NormalizationsSimilarity Between ObjectsSimilarity Between Objects (Cont.)Similarity with respect to a Set of Binary VariablesSimilarity between Binary Variable SetsNominal VariablesOrdinal VariablesRatio-Scaled VariablesCase Study --- NormalizationCase Study --- Weight Selection and Similarity Measure SelectionMajor Clustering ApproachesPartitioning Algorithms: Basic ConceptThe K-Means Clustering MethodSlide 26Comments on the K-Means MethodHierarchical ClusteringPowerPoint PresentationSlide 30Slide 31Self-organizing feature maps (SOMs)Problems and ChallengesSummary Object Similarity & ClusteringReferences (1)References (2)1Han, Kamber, Eick: Object Similarity & ClusteringClustering andObject Similarity Evaluation ©Jiawei Han and Micheline Kamberwith Additions and Modifications by Ch. EickOrganization for COSC 6340:1. What is Clustering?2. Object Similarity Measurement3. K-Means Clustering Algorithm4. Hierarchical Clustering3Han, Kamber, Eick: Object Similarity & ClusteringExamples of Clustering ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programsLand use: Identification of areas of similar land use in an earth observation databaseInsurance: Identifying groups of motor insurance policy holders with a high average claim costCity-planning: Identifying groups of houses according to their house type, value, and geographical locationEarth-quake studies: Observed earth quake epicenters should be clustered along continent faults4Han, Kamber, Eick: Object Similarity & ClusteringRequirements of Clustering in Data Mining ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeMinimal requirements for domain knowledge to determine input parametersAble to deal with noise and outliersInsensitive to order of input recordsHigh dimensionalityIncorporation of user-specified constraintsInterpretability and usability5Han, Kamber, Eick: Object Similarity & ClusteringData Structures for ClusteringData matrix(n objects, p attributes)(Dis)Similarity matrix(nxn)npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)06Han, Kamber, Eick: Object Similarity & ClusteringQuality Evaluation of ClustersDissimilarity/Similarity metric: Similarity is expressed in terms of a normalized distance function d, which is typically metric; typically:  (oi, oj) = 1 - d (oi, oj) There is a separate “quality” function that measures the “goodness” of a cluster.The definitions of similarity functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.Weights should be associated with different variables based on applications and data semantics.It is hard to define “similar enough” or “good enough”  the answer is typically highly subjective.7Han, Kamber, Eick: Object Similarity & ClusteringChallenges in Obtaining Object Similarity MeasuresMany Types of VariablesInterval-scaled variablesBinary variables and nominal variablesOrdinal variablesRatio-scaled variablesObjects are characterized by variables belonging to different types (mixture of variables)8Han, Kamber, Eick: Object Similarity & ClusteringCase Study: Patient SimilarityThe following relation is given (with 10000 tuples):Patient(ssn, weight, height, cancer-sev, eye-color, age)Attribute Domainsssn: 9 digitsweight between 30 and 650; mweight=158 sweight=24.20height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2cancer-sev: 4=serious 3=quite_serious 2=medium 1=minoreye-color: {brown, blue, green, grey }age: between 3 and 100; mage=45 sage=13.2Task: Define Patient Similarity9Han, Kamber, Eick: Object Similarity & ClusteringGenerating a Global Similarity Measure from Single Variable Similarity Measures Assumption: A database may contain up to six types of variables: symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.1. Standardize variable and associate similarity measure i with the standardized i-th variable and determine weight wi of the i-th variable.2. Create the following global (dis)similarity measure :ffjifjiwpfwoopfoo1*)(1,),(10Han, Kamber, Eick: Object Similarity & ClusteringA Methodology to Obtain a Similarity Matrix1. Understand Variables 2. Remove (non-relevant and redundant) Variables3. (Standardize and) Normalize Variables (typically using z-scores or variable values are transformed to numbers in [0,1])4. Associate (Dis)Similarity Measure df/f with each Variable5. Associate a Weight (measuring its importance) with each Variable6. Compute the (Dis)Similarity Matrix7. Apply Similarity-based Data Mining Technique (e.g. Clustering, Nearest Neighbor, Multi-dimensional Scaling,…)11Han, Kamber, Eick: Object Similarity & ClusteringInterval-scaled VariablesStandardize data using z-scoresCalculate the mean absolute deviation:whereCalculate the standardized measurement (z-score)Using mean absolute deviation is more robust than using standard deviation .)...211nffffxx(xn m|)|...|||(|121 fnffffffmxmxmxns ffififsmx z12Han, Kamber, Eick: Object Similarity & ClusteringNormalization in [0,1]Problem: If non-normalized variables are used the maximum distance between two values can be greater than 1. Solution: Normalize interval-scaled variables usingwhere minf denotes the minimum value and maxf denotes the maximum value of the f-th attribute in the data set and is constant that is choses depending on the similarity measure (e.g. if Manhattan distance is used  is chosen to be 1).)*)min/((max)min(fffififzx 13Han, Kamber, Eick: Object Similarity & ClusteringOther NormalizationsGoal:Limit the maximum distance to 1Start using a


View Full Document

UH COSC 6340 - Clustering and Object Similarity Evaluation

Download Clustering and Object Similarity Evaluation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Clustering and Object Similarity Evaluation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering and Object Similarity Evaluation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?