FIU CAP 4770 - Chapter 2: Data Preprocessing (21 pages)

Previewing pages 1, 2, 20, 21 of 21 page document View the full content.
View Full Document

Chapter 2: Data Preprocessing



Previewing pages 1, 2, 20, 21 of actual document.

View the full content.
View Full Document
View Full Document

Chapter 2: Data Preprocessing

75 views

Lecture Notes


Pages:
21
School:
Florida International University
Course:
Cap 4770 - Introduction to Data Mining
Introduction to Data Mining Documents

Unformatted text preview:

Chapter 2 Data Preprocessing Why preprocess the data Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary 01 14 19 Data Mining Concepts and Techniq ues 1 Data Cleaning Importance Data cleaning is one of the three biggest problems in data warehousing Ralph Kimball Data cleaning is the number one problem in data warehousing DCI survey Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration 01 14 19 Data Mining Concepts and Techniq ues 2 Missing Data Data is not always available 01 14 19 E g many tuples have no recorded value for several attributes such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred Data Mining Concepts and Techniq ues 3 How to Handle Missing Data Ignore the tuple usually done when class label is missing assuming the tasks in classification not effective when the percentage of missing values per attribute varies considerably Fill in the missing value manually tedious infeasible Fill in it automatically with a global constant e g unknown a new class the attribute mean the attribute mean for all samples belonging to the same class smarter the most probable value inference based such as Bayesian formula or decision tree 01 14 19 Data Mining Concepts and Techniq ues 4 Noisy Data Noise random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete



View Full Document

Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view Chapter 2: Data Preprocessing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Chapter 2: Data Preprocessing and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?