TEMPLE CIS 664 - Data Preprocessing - D1728363

Home> Schools> Temple University> Computer & Information Science (CIS) > CIS 664> Data Preprocessing

TEMPLE CIS 664 - Data Preprocessing

Course Cis 664- Knowledge Discovery and Data Mining

Pages 52

Download Save

Unformatted text preview:

CIS664-Knowledge Discovery and Data MiningAgendaWhy Data Preprocessing?Major Tasks in Data PreprocessingForms of data preprocessingSlide 6Data CleaningMissing DataHow to Handle Missing Data?Noisy DataHow to Handle Noisy Data?Simple Discretization Methods: BinningBinning Methods for Data SmoothingCluster AnalysisRegressionHow to Handle Inconsistent Data?Slide 17Data IntegrationHandling Redundant Data in Data IntegrationData TransformationData Transformation: NormalizationSlide 22Data ReductionSlide 24Data Cube AggregationDimensionality ReductionSlide 27Data CompressionSlide 30Wavelet TransformsPrincipal Component Analysis (PCA) Karhunen-Loeve (K-L) methodSlide 33Numerosity ReductionRegression and Log-Linear ModelsRegression Analysis and Log-Linear ModelsHistogramsClusteringSamplingSlide 40Slide 41Hierarchical ReductionSlide 43Discretization/QuantizationDiscretization and Concept HierarchyDiscretization and concept hierarchy generation for numeric dataEntropy-Based DiscretizationSegmentation by natural partitioningExample of 3-4-5 ruleConcept hierarchy generation for categorical dataConcept hierarchy generation w/o data semantics - Specification of a set of attributesSlide 52SummaryCIS664-Knowledge Discovery and Data MiningVasileios MegalooikonomouDept. of Computer and Information SciencesTemple UniversityData Preprocessing(based on notes by Jiawei Han and Micheline Kamber)Agenda•Why data preprocessing?•Data cleaning •Data integration and transformation•Data reduction•Discretization and concept hierarchy generation•SummaryWhy Data Preprocessing?•Data in the real world is dirty–incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data–noisy: containing errors or outliers–inconsistent: containing discrepancies in codes or names•No quality data, no quality mining results!–Quality decisions must be based on quality data–Data warehouse needs consistent integration of quality data•A multi-dimensional measure of data quality:–A well-accepted multi-dimensional view: •accuracy, completeness, consistency, timeliness, believability, value added, interpretability, accessibility–Broad categories:•intrinsic, contextual, representational, and accessibility.Major Tasks in Data Preprocessing•Data cleaning–Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies•Data integration–Integration of multiple databases, data cubes, files, or notes•Data transformation–Normalization (scaling to a specific range)–Aggregation•Data reduction–Obtains reduced representation in volume but produces the same or similar analytical results–Data discretization: with particular importance, especially for numerical data–Data aggregation, dimensionality reduction, data compression,generalizationForms of data preprocessingAgenda•Why preprocess the data?•Data cleaning •Data integration and transformation•Data reduction•Discretization and concept hierarchy generation•SummaryData Cleaning•Data cleaning tasks–Fill in missing values–Identify outliers and smooth out noisy data –Correct inconsistent dataMissing Data•Data is not always available–E.g., many tuples have no recorded value for several attributes, such as customer income in sales data•Missing data may be due to –equipment malfunction–inconsistent with other recorded data and thus deleted–data not entered due to misunderstanding–certain data may not be considered important at the time of entry–not register history or changes of the data•Missing data may need to be inferredHow to Handle Missing Data?•Ignore the tuple: usually done when class label is missing (assuming the task is classification—not effective in certain cases) •Fill in the missing value manually: tedious + infeasible?•Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! •Use the attribute mean to fill in the missing value•Use the attribute mean for all samples of the same class to fill in the missing value: smarter•Use the most probable value to fill in the missing value: inference-based such as regression, Bayesian formula, decision treeNoisy Data•Q: What is noise? •A: Random error in a measured variable.•Incorrect attribute values may be due to–faulty data collection instruments–data entry problems–data transmission problems–technology limitation–inconsistency in naming convention •Other data problems which requires data cleaning–duplicate records–incomplete data–inconsistent dataHow to Handle Noisy Data?•Binning method:–first sort data and partition into (equi-depth) bins–then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.–used also for discretization (discussed later)•Clustering–detect and remove outliers•Semi-automated method: combined computer and human inspection–detect suspicious values and check manually•Regression–smooth by fitting the data into regression functionsSimple Discretization Methods: Binning•Equal-width (distance) partitioning:–It divides the range into N intervals of equal size: uniform grid–if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.–The most straightforward–But outliers may dominate presentation–Skewed data is not handled well.•Equal-depth (frequency) partitioning:–It divides the range into N intervals, each containing approximately same number of samples–Good data scaling–Managing categorical attributes can be tricky.Binning Methods for Data Smoothing* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34Cluster AnalysisRegressionxyy = x + 1X1Y1Y1’•Linear regression (best line to fit two variables)•Multiple linear regression (more than two variables, fit to a multidimensional surfaceHow to Handle Inconsistent Data?•Manual correction using external references•Semi-automatic using various tools–To detect violation of known functional dependencies and data constraints–To

View Full Document


School:
Email:
New Password:
Confirm Password:

TEMPLE CIS 664 - Data Preprocessing

Sign up for free to view:

Please select your school