Chapter 2: Data PreprocessingData CleaningMissing DataHow to Handle Missing Data?Noisy DataHow to Handle Noisy Data?Simple Discretization Methods: BinningBinning Methods for Data SmoothingRegressionCluster AnalysisData Cleaning as a ProcessSlide 12Data IntegrationHandling Redundancy in Data IntegrationCorrelation Analysis (Numerical Data)Correlation Analysis (Categorical Data)Chi-Square Calculation: An ExampleData TransformationData NormalizationData Transformation: NormalizationZ-Score (Example)01/14/19Data Mining: Concepts and Techniques 1Chapter 2: Data PreprocessingWhy preprocess the data?Descriptive data summarizationData cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummary01/14/19Data Mining: Concepts and Techniques 2Data CleaningImportance“Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball“Data cleaning is the number one problem in data warehousing”—DCI surveyData cleaning tasksFill in missing valuesIdentify outliers and smooth out noisy data Correct inconsistent dataResolve redundancy caused by data integration01/14/19Data Mining: Concepts and Techniques 3Missing DataData is not always availableE.g., many tuples have no recorded value for several attributes, such as customer income in sales dataMissing data may be due to equipment malfunctioninconsistent with other recorded data and thus deleteddata not entered due to misunderstandingcertain data may not be considered important at the time of entrynot register history or changes of the dataMissing data may need to be inferred.01/14/19Data Mining: Concepts and Techniques 4How to Handle Missing Data?Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.Fill in the missing value manually: tedious + infeasible?Fill in it automatically witha global constant : e.g., “unknown”, a new class?! the attribute meanthe attribute mean for all samples belonging to the same class: smarterthe most probable value: inference-based such as Bayesian formula or decision tree01/14/19Data Mining: Concepts and Techniques 5Noisy DataNoise: random error or variance in a measured variableIncorrect attribute values may due tofaulty data collection instrumentsdata entry problemsdata transmission problemstechnology limitationinconsistency in naming convention Other data problems which requires data cleaningduplicate recordsincomplete datainconsistent data01/14/19Data Mining: Concepts and Techniques 6How to Handle Noisy Data?Binningfirst sort data and partition into (equal-frequency) binsthen one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.Regressionsmooth by fitting the data into regression functionsClusteringdetect and remove outliersCombined computer and human inspectiondetect suspicious values and check by human (e.g., deal with possible outliers)01/14/19Data Mining: Concepts and Techniques 7Simple Discretization Methods: BinningEqual-width (distance) partitioningDivides the range into N intervals of equal size: uniform gridif A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.The most straightforward, but outliers may dominate presentationSkewed data is not handled wellEqual-depth (frequency) partitioningDivides the range into N intervals, each containing approximately same number of samplesGood data scalingManaging categorical attributes can be tricky01/14/19Data Mining: Concepts and Techniques 8Binning Methods for Data SmoothingSorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34* Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 3401/14/19Data Mining: Concepts and Techniques 9Regressionxyy = x + 1X1Y1Y1’01/14/19Data Mining: Concepts and Techniques 10Cluster Analysis01/14/19Data Mining: Concepts and Techniques 11Data Cleaning as a ProcessData discrepancy detectionUse metadata (e.g., domain, range, dependency, distribution)Check field overloading Check uniqueness rule, consecutive rule and null ruleUse commercial toolsData scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make correctionsData auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)Data migration and integrationData migration tools: allow transformations to be specifiedETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interfaceIntegration of the two processesIterative and interactive (e.g., Potter’s Wheels)01/14/19Data Mining: Concepts and Techniques 12Chapter 2: Data PreprocessingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummary01/14/19Data Mining: Concepts and Techniques 13Data IntegrationData integration: Combines data from multiple sources into a coherent storeSchema integration: e.g., A.cust-id B.cust-#Integrate metadata from different sourcesEntity identification problem: Identify real world entities from multiple data sources, e.g., Bill Clinton = William ClintonDetecting and resolving data value conflictsFor the same real world entity, attribute values from different sources are differentPossible reasons: different representations, different scales, e.g., metric vs. British units01/14/19Data Mining: Concepts and Techniques 14Handling Redundancy in Data IntegrationRedundant data occur often when integration of multiple databasesObject identification: The same attribute or object may have different names in different databasesDerivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenueRedundant attributes may be able to be detected by correlation
View Full Document