UH COSC 6340 - COSC 6340 Clustering and Similarity Assessment - D388384

Home> Schools> University of Houston> Computer Science (COSC) > COSC 6340> COSC 6340 Clustering and Similarity Assessment

DOC PREVIEW

UH COSC 6340 - COSC 6340 Clustering and Similarity Assessment

School name University of Houston

Course Cosc 6340- Database Systems

Pages 52

This preview shows page 1-2-3-25-26-27-28-50-51-52 out of 52 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

KDD --- 2004 LecturesClustering and Similarity AssessmentMotivation: Why Clustering?Examples of Clustering ApplicationsRequirements of Clustering in Data MiningData Structures for ClusteringQuality Evaluation of ClustersChallenges in Obtaining Object Similarity MeasuresCase Study: Patient SimilarityGenerating a Global Similarity Measure from Single Variable Similarity MeasuresA Methodology to Obtain a Similarity MatrixInterval-scaled VariablesNormalization in [0,1]Other NormalizationsSimilarity Between ObjectsSimilarity Between Objects (Cont.)Similarity with respect to a Set of Binary VariablesSimilarity between Binary Variable SetsNominal VariablesOrdinal VariablesRatio-Scaled VariablesCase Study --- NormalizationCase Study --- Weight Selection and Similarity Measure SelectionMajor Clustering ApproachesPartitioning Algorithms: Basic ConceptThe K-Means Clustering MethodSlide 28Comments on the K-Means MethodPAM (Partitioning Around Medoids) (1987)PAM Clustering: Total swapping cost TCih=jCjihCLARANS (“Randomized” CLARA) (1994)Grid-Based Clustering MethodSTING: A Statistical Information Grid ApproachSTING: A Statistical Information Grid Approach (2)STING: A Statistical Information Grid Approach (3)CLIQUE (Clustering In QUEst)CLIQUE: The Major StepsPowerPoint PresentationStrength and Weakness of CLIQUEWork at UH related to Similarity Assessment and ClusteringSlide 42Slide 43Slide 44Research Goals Supervised ClusteringWhat is a good object distance function q for supervised similarity assessment?Idea: Coevolving Clusters and Similarity FunctionsIdea CR*-ApproachWeight Adjustment within a ClusterSummary: Problems and Challenges for ClusteringSummary Object Similarity & ClusteringReferences (1)References (2)1Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340March 4+9: Introduction to KDDMarch 11: Association Rule MiningMarch 23: Similarity Assessment March 25: Clustering and UHDM2March 30: Data Warehouses and OLAPKDD --- 2004 Lectures2Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Clustering andSimilarity Assessment ©Jiawei Han and Micheline Kamberwith major Additions and Modifications by Ch. EickOrganization for COSC 6340:1. What is Clustering?2. Object Similarity Assessment3. K-means/medoid Clustering4. Grid-based Clustering5. Work at UH4Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Motivation: Why Clustering?Problem: Identify (a small number of) groups of similar objects in a given (large) set of object.Goals: Find representatives for homogeneous groups Data Compression Find “natural” clusters and describe their properties ”natural” Data TypesFind suitable and useful grouping ”useful” Data ClassesFind unusual data object Outlier Detection5Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Examples of Clustering ApplicationsPlant/Animal ClassificationBook Ordering Cloth SizesFraud Detection (Find outlier)6Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Requirements of Clustering in Data Mining ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeMinimal requirements for domain knowledge to determine input parametersAble to deal with noise and outliersInsensitive to order of input recordsHigh dimensionalityIncorporation of user-specified constraintsInterpretability and usability7Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Data Structures for ClusteringData matrix(n objects, p attributes)(Dis)Similarity matrix(nxn)npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)08Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Quality Evaluation of ClustersDissimilarity/Similarity metric: Similarity is expressed in terms of a normalized distance function d, which is typically metric; typically:  (oi, oj) = 1 - d (oi, oj) There is a separate “quality” function that measures the “goodness” of a cluster.The definitions of similarity functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio-scaled variables.Weights should be associated with different variables based on applications and data semantics.It is hard to define “similar enough” or “good enough”  the answer is typically highly subjective.9Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Challenges in Obtaining Object Similarity MeasuresMany Types of VariablesInterval-scaled variablesBinary variables and nominal variablesOrdinal variablesRatio-scaled variablesObjects are characterized by variables belonging to different types (mixture of variables)10Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Case Study: Patient SimilarityThe following relation is given (with 10000 tuples):Patient(ssn, weight, height, cancer-sev, eye-color, age)Attribute Domainsssn: 9 digitsweight between 30 and 650; mweight=158 sweight=24.20height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2cancer-sev: 4=serious 3=quite_serious 2=medium 1=minoreye-color: {brown, blue, green, grey }age: between 3 and 100; mage=45 sage=13.2Task: Define Patient Similarity11Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340Generating a Global Similarity Measure from Single Variable Similarity Measures Assumption: A database may contain up to six types of variables: symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.1. Standardize variable and associate similarity measure i with the standardized i-th variable and determine weight wi of the i-th variable.2. Create the following global (dis)similarity measure :ffjifjiwpfwoopfoo1*)(1,),(12Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340A Methodology to Obtain a Similarity Matrix1. Understand Variables 2. Remove (non-relevant and redundant) Variables3. (Standardize and) Normalize Variables (typically using z-scores or variable values are transformed to numbers in [0,1])4. Associate (Dis)Similarity Measure df/f with each Variable5. Associate a Weight (measuring its importance) with each Variable6. Compute the (Dis)Similarity Matrix7. Apply Similarity-based Data Mining Technique

View Full Document