DOC PREVIEW
NYU CSCI-GA 3033 - Data Mining

This preview shows page 1-2-3-18-19-36-37-38 out of 38 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Data Mining: CharacterizationConcept Description: Characterization and ComparisonWhat is Concept Description?Slide 4Data Generalization and Summarization-based CharacterizationAttribute-Oriented InductionBasic Principles of Attribute-Oriented InductionExampleClass Characterization: An ExampleSlide 10Attribute Relevance AnalysisAttribute relevance analysis (cont’d)Relevance MeasuresInformation-Theoretic ApproachTop-Down Induction of Decision TreeExample: Analytical CharacterizationExample: Analytical Characterization (cont’d)Example: Analytical characterization (2)Example: Analytical characterization (3)Example: Analytical Characterization (4)Example: Analytical characterization (5)Slide 22Mining Class ComparisonsExample: Analytical comparisonExample: Analytical comparison (2)Example: Analytical comparison (3)Example: Analytical comparison (4)Example: Analytical comparison (5)Slide 29Mining Data Dispersion CharacteristicsMeasuring the Central TendencyMeasuring the Dispersion of DataBoxplot AnalysisA BoxplotSlide 35SummaryReferencesReferences (cont.)Data Mining: CharacterizationConcept Description: Characterization and ComparisonWhat is concept description? Data generalization and summarization-based characterizationAnalytical characterization: Analysis of attribute relevanceMining class comparisons: Discriminating between different classesMining descriptive statistical measures in large databasesSummaryWhat is Concept Description?Descriptive vs. predictive data miningDescriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative formsPredictive mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown dataConcept description: Characterization: provides a concise and succinct summarization of the given collection of dataComparison: provides descriptions comparing two or more collections of dataConcept Description: Characterization and ComparisonWhat is concept description? Data generalization and summarization-based characterizationAnalytical characterization: Analysis of attribute relevanceMining class comparisons: Discriminating between different classesMining descriptive statistical measures in large databasesSummaryData Generalization and Summarization-based CharacterizationData generalizationA process which abstracts a large set of task-relevant data in a database from a low conceptual levels to higher ones.12345Conceptual levelsAttribute-Oriented InductionProposed in 1989 (KDD ‘89 workshop)Not confined to categorical data nor particular measures.How it is done?Collect the task-relevant data( initial relation) using a relational database queryPerform generalization by attribute removal or attribute generalization.Apply aggregation by merging identical, generalized tuples and accumulating their respective counts.Interactive presentation with users.Basic Principles of Attribute-Oriented InductionData focusing: task-relevant data, including dimensions, and the result is the initial relation.Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes.Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A. Attribute-threshold control: typical 2-8, specified/default.Generalized relation threshold control: control the final relation/rule size.ExampleDescribe general characteristics of graduate students in the Big-University databaseuse Big_University_DBmine characteristics as “Science_Students”in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpafrom studentwhere status in “graduate”Corresponding SQL statement:Select name, gender, major, birth_place, birth_date, residence, phone#, gpafrom studentwhere status in {“Msc”, “MBA”, “PhD” }Class Characterization: An ExampleName Gender Major Birth-Place Birth_date Residence Phone # GPAJimWoodman M CS Vancouver,BC,Canada 8-12-76 3511 Main St.,Richmond687-4598 3.67ScottLachance M CS Montreal, Que,Canada28-7-75 345 1st Ave.,Richmond253-9106 3.70Laura Lee… F…Physics…Seattle, WA, USA…25-8-70…125 Austin Ave.,Burnaby…420-5232…3.83…Removed Retained Sci,Eng,BusCountry Age range City Removed Excl,VG,..Gender Major Birth_region Age_range Residence GPA Count M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 … … … … … … … Birth_RegionGenderCanada Foreign Total M 16 14 30 F 10 22 32 Total 26 36 62Prime Generalized RelationInitial RelationConcept Description: Characterization and ComparisonWhat is concept description? Data generalization and summarization-based characterizationAnalytical characterization: Analysis of attribute relevanceMining class comparisons: Discriminating between different classesMining descriptive statistical measures in large databasesSummaryAttribute Relevance AnalysisWhy?Which dimensions should be included? How high level of generalization?Automatic vs. interactiveReduce # attributes; easy to understand patternsWhat?statistical method for preprocessing datafilter out irrelevant or weakly relevant attributes retain or rank the relevant attributesrelevance related to dimensions and levelsanalytical characterization, analytical comparisonAttribute relevance analysis (cont’d)How?Data CollectionAnalytical GeneralizationUse information gain analysis (e.g., entropy or other measures) to identify highly relevant dimensions and levels.Relevance AnalysisSort and select the most relevant dimensions and levels.Attribute-oriented Induction for class descriptionOn selected dimension/levelOLAP operations (e.g. drilling, slicing) on relevance rulesRelevance Measures Quantitative relevance measure determines the classifying power of an attribute within a set of data.Methodsinformation gain (ID3)gain ratio (C4.5)2 contingency table statisticsuncertainty coefficientInformation-Theoretic ApproachDecision


View Full Document

NYU CSCI-GA 3033 - Data Mining

Documents in this Course
Design

Design

2 pages

Real Time

Real Time

17 pages

Load more
Download Data Mining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Data Mining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Data Mining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?