Data Mining: CharacterizationConcept Description: Characterization and ComparisonWhat is Concept Description?Slide 4Data Generalization and Summarization-based CharacterizationAttribute-Oriented InductionBasic Principles of Attribute-Oriented InductionExampleClass Characterization: An ExampleSlide 10Attribute Relevance AnalysisAttribute relevance analysis (cont’d)Relevance MeasuresInformation-Theoretic ApproachTop-Down Induction of Decision TreeExample: Analytical CharacterizationExample: Analytical Characterization (cont’d)Example: Analytical characterization (2)Example: Analytical characterization (3)Example: Analytical Characterization (4)Example: Analytical characterization (5)Slide 22Mining Class ComparisonsExample: Analytical comparisonExample: Analytical comparison (2)Example: Analytical comparison (3)Example: Analytical comparison (4)Example: Analytical comparison (5)Slide 29Mining Data Dispersion CharacteristicsMeasuring the Central TendencyMeasuring the Dispersion of DataBoxplot AnalysisA BoxplotSlide 35SummaryReferencesReferences (cont.)Data Mining: CharacterizationConcept Description: Characterization and ComparisonWhat is concept description? Data generalization and summarization-based characterizationAnalytical characterization: Analysis of attribute relevanceMining class comparisons: Discriminating between different classesMining descriptive statistical measures in large databasesSummaryWhat is Concept Description?Descriptive vs. predictive data miningDescriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative formsPredictive mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown dataConcept description: Characterization: provides a concise and succinct summarization of the given collection of dataComparison: provides descriptions comparing two or more collections of dataConcept Description: Characterization and ComparisonWhat is concept description? Data generalization and summarization-based characterizationAnalytical characterization: Analysis of attribute relevanceMining class comparisons: Discriminating between different classesMining descriptive statistical measures in large databasesSummaryData Generalization and Summarization-based CharacterizationData generalizationA process which abstracts a large set of task-relevant data in a database from a low conceptual levels to higher ones.12345Conceptual levelsAttribute-Oriented InductionProposed in 1989 (KDD ‘89 workshop)Not confined to categorical data nor particular measures.How it is done?Collect the task-relevant data( initial relation) using a relational database queryPerform generalization by attribute removal or attribute generalization.Apply aggregation by merging identical, generalized tuples and accumulating their respective counts.Interactive presentation with users.Basic Principles of Attribute-Oriented InductionData focusing: task-relevant data, including dimensions, and the result is the initial relation.Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes.Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A. Attribute-threshold control: typical 2-8, specified/default.Generalized relation threshold control: control the final relation/rule size.ExampleDescribe general characteristics of graduate students in the Big-University databaseuse Big_University_DBmine characteristics as “Science_Students”in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpafrom studentwhere status in “graduate”Corresponding SQL statement:Select name, gender, major, birth_place, birth_date, residence, phone#, gpafrom studentwhere status in {“Msc”, “MBA”, “PhD” }Class Characterization: An ExampleName Gender Major Birth-Place Birth_date Residence Phone # GPAJimWoodman M CS Vancouver,BC,Canada 8-12-76 3511 Main St.,Richmond687-4598 3.67ScottLachance M CS Montreal, Que,Canada28-7-75 345 1st Ave.,Richmond253-9106 3.70Laura Lee… F…Physics…Seattle, WA, USA…25-8-70…125 Austin Ave.,Burnaby…420-5232…3.83…Removed Retained Sci,Eng,BusCountry Age range City Removed Excl,VG,..Gender Major Birth_region Age_range Residence GPA Count M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 … … … … … … … Birth_RegionGenderCanada Foreign Total M 16 14 30 F 10 22 32 Total 26 36 62Prime Generalized RelationInitial RelationConcept Description: Characterization and ComparisonWhat is concept description? Data generalization and summarization-based characterizationAnalytical characterization: Analysis of attribute relevanceMining class comparisons: Discriminating between different classesMining descriptive statistical measures in large databasesSummaryAttribute Relevance AnalysisWhy?Which dimensions should be included? How high level of generalization?Automatic vs. interactiveReduce # attributes; easy to understand patternsWhat?statistical method for preprocessing datafilter out irrelevant or weakly relevant attributes retain or rank the relevant attributesrelevance related to dimensions and levelsanalytical characterization, analytical comparisonAttribute relevance analysis (cont’d)How?Data CollectionAnalytical GeneralizationUse information gain analysis (e.g., entropy or other measures) to identify highly relevant dimensions and levels.Relevance AnalysisSort and select the most relevant dimensions and levels.Attribute-oriented Induction for class descriptionOn selected dimension/levelOLAP operations (e.g. drilling, slicing) on relevance rulesRelevance Measures Quantitative relevance measure determines the classifying power of an attribute within a set of data.Methodsinformation gain (ID3)gain ratio (C4.5)2 contingency table statisticsuncertainty coefficientInformation-Theoretic ApproachDecision
View Full Document