DOC PREVIEW
NYU CSCI-GA 3033 - Characterization

This preview shows page 1-2-23-24 out of 24 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Data Mining Session 5 Main Theme Characterization Dr Jean Claude Franchitti New York University Computer Science Department Courant Institute of Mathematical Sciences Adapted from course textbook resources Data Mining Concepts and Techniques 2nd Edition Jiawei Han and Micheline Kamber 1 Agenda 11 Session Session Overview Overview 22 Characterization Characterization 33 Summary Summary and and Conclusion Conclusion 2 Characterization in Brief What is Concept Description Data generalization and summarizationbased characterization Analytical characterization Analysis of attribute relevance Mining class comparisons Discriminating between different classes Mining descriptive statistical measures in large databases 3 Icons Metaphors Information Common Realization Knowledge Competency Pattern Governance Alignment Solution Approach 44 Agenda 11 Session Session Overview Overview 22 Characterization Characterization 33 Summary Summary and and Conclusion Conclusion 5 Concept Description Characterization and Comparison What is Concept Description Data generalization and summarizationbased characterization Analytical characterization Analysis of attribute relevance Mining class comparisons Discriminating between different classes Mining descriptive statistical measures in large databases 6 What is Concept Description Descriptive vs predictive data mining Descriptive mining describes concepts or taskrelevant data sets in concise summarative informative discriminative forms Predictive mining Based on data and analysis constructs models for the database and predicts the trend and properties of unknown data Concept description Characterization provides a concise and succinct summarization of the given collection of data Comparison provides descriptions comparing two or more collections of data 7 Concept Description Characterization and Comparison What is Concept Description Data generalization and summarizationbased characterization Analytical characterization Analysis of attribute relevance Mining class comparisons Discriminating between different classes Mining descriptive statistical measures in large databases 8 Data Generalization and Summarization based Characterization Data generalization A process which abstracts a large set of task relevant data in a database from a low conceptual levels to higher ones 1 2 3 4 5 Conceptual levels Approaches Data cube approach OLAP approach Attribute oriented induction approach 9 Characterization Data Cube Approach Perform computations and store results in data cubes Strength An efficient implementation of data generalization Computation of various kinds of measures e g count sum average max Generalization and specialization can be performed on a data cube by roll up and drill down Limitations handle only dimensions of simple nonnumeric data and measures of simple aggregated numeric values Lack of intelligent analysis can t tell which dimensions should be used and what levels should the generalization reach 10 Attribute Oriented Induction Proposed in 1989 KDD 89 workshop Not confined to categorical data nor particular measures How it is done Collect the task relevant data initial relation using a relational database query Perform generalization by attribute removal or attribute generalization Apply aggregation by merging identical generalized tuples and accumulating their respective counts Interactive presentation with users 11 Basic Principles of Attribute Oriented Induction Data focusing task relevant data including dimensions and the result is the initial relation Attribute removal remove attribute A if there is a large set of distinct values for A but 1 there is no generalization operator on A or 2 A s higher level concepts are expressed in terms of other attributes Attribute generalization If there is a large set of distinct values for A and there exists a set of generalization operators on A then select an operator and generalize A Attribute threshold control typical 2 8 specified default Generalized relation threshold control control the final relation rule size 12 Example Describe general characteristics of graduate students in the Big University database use Big University DB mine characteristics as Science Students in relevance to name gender major birth place birth date residence phone gpa from student where status in graduate Corresponding SQL statement Select name gender major birth place birth date residence phone gpa from student where status in Msc MBA PhD 13 Class Characterization An Example Name Gender Jim Initial Woodman Relation Scott M Major M F Removed Retained Sci Eng Bus Gender Major M F Birth date Vancouver BC 8 12 76 Canada CS Montreal Que 28 7 75 Canada Physics Seattle WA USA 25 8 70 Lachance Laura Lee Prime Generalized Relation Birth Place CS Science Science Country Age range Residence Phone GPA 3511 Main St Richmond 345 1st Ave Richmond 687 4598 3 67 253 9106 3 70 125 Austin Ave Burnaby 420 5232 3 83 City Removed Excl VG Birth region Age range Residence GPA Canada Foreign 20 25 25 30 Richmond Burnaby Very good Excellent Count 16 22 Birth Region Canada Foreign Total Gender M 16 14 30 F 10 22 32 Total 26 36 62 14 Concept Description Characterization and Comparison What is Concept Description Data generalization and summarizationbased characterization Analytical characterization Analysis of attribute relevance Mining class comparisons Discriminating between different classes Mining descriptive statistical measures in large databases 15 Characterization vs OLAP Similarity Presentation of data summarization at multiple levels of abstraction Interactive drilling pivoting slicing and dicing Differences Automated desired level allocation Dimension relevance analysis and ranking when there are many relevant dimensions Sophisticated typing on dimensions and measures Analytical characterization data dispersion analysis 16 Attribute Relevance Analysis Why Which dimensions should be included How high level of generalization Automatic vs interactive Reduce attributes easy to understand patterns What statistical method for preprocessing data filter out irrelevant or weakly relevant attributes retain or rank the relevant attributes relevance related to dimensions and levels analytical characterization analytical comparison 17 Attribute relevance analysis continued How Data Collection Analytical Generalization Use information gain analysis e g entropy or other measures to identify highly relevant dimensions and levels Relevance Analysis Sort


View Full Document

NYU CSCI-GA 3033 - Characterization

Documents in this Course
Design

Design

2 pages

Real Time

Real Time

17 pages

Load more
Download Characterization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Characterization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Characterization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?