Data Mining: Data PreparationData PreprocessingWhy Data Preprocessing?Multi-Dimensional Measure of Data QualityMajor Tasks in Data PreprocessingForms of data preprocessingSlide 7Data CleaningMissing DataHow to Handle Missing Data?Noisy DataHow to Handle Noisy Data?Simple Discretization Methods: BinningBinning Methods for Data SmoothingSlide 15Data IntegrationHandling Redundant DataData TransformationData Transformation: NormalizationSlide 20Data Reduction StrategiesData Cube AggregationDimensionality ReductionSlide 24Regression and Log-Linear ModelsRegress Analysis and Log-Linear ModelsHistogramsClusteringSamplingSlide 31Slide 32DiscretizationDiscretization and Concept hierachyDiscretization for numeric dataSlide 36SummaryReferencesData Mining: Data PreparationData PreprocessingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummaryWhy Data Preprocessing?Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate datanoisy: containing errors or outliersinconsistent: containing discrepancies in codes or namesNo quality data, no quality mining results!Quality decisions must be based on quality dataData warehouse needs consistent integration of quality dataMulti-Dimensional Measure of Data QualityA well-accepted multidimensional view:AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibilityMajor Tasks in Data PreprocessingData cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistenciesData integrationIntegration of multiple databases, data cubes, or filesData transformationNormalization and aggregationData reductionObtains reduced representation in volume but produces the same or similar analytical resultsData discretizationPart of data reduction but with particular importance, especially for numerical dataForms of data preprocessingData PreprocessingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummaryData CleaningData cleaning tasksFill in missing valuesIdentify outliers and smooth out noisy data Correct inconsistent dataMissing DataData is not always availableE.g., many tuples have no recorded value for several attributes, such as customer income in sales dataMissing data may be due to equipment malfunctioninconsistent with other recorded data and thus deleteddata not entered due to misunderstandingcertain data may not be considered important at the time of entrynot register history or changes of the dataMissing data may need to be inferred.How to Handle Missing Data?Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably)Fill in the missing value manually: tedious + infeasible?Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! Use the attribute mean to fill in the missing valueUse the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision treeNoisy DataNoise: random error or variance in a measured variableIncorrect attribute values may due tofaulty data collection instrumentsdata entry problemsdata transmission problemstechnology limitationinconsistency in naming convention Other data problems which requires data cleaningduplicate recordsincomplete datainconsistent dataHow to Handle Noisy Data?Binning method:first sort data and partition into (equi-depth) binsthen smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.Clusteringdetect and remove outliersCombined computer and human inspectiondetect suspicious values and check by humanRegressionsmooth by fitting the data into regression functionsSimple Discretization Methods: BinningEqual-width (distance) partitioning:It divides the range into N intervals of equal size: uniform gridif A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.The most straightforwardBut outliers may dominate presentationSkewed data is not handled well.Equal-depth (frequency) partitioning:It divides the range into N intervals, each containing approximately same number of samplesGood data scalingManaging categorical attributes can be tricky.Binning Methods for Data Smoothing* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34Data PreprocessingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummaryData IntegrationData integration: combines data from multiple sources into a coherent storeSchema integrationintegrate metadata from different sourcesEntity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#Detecting and resolving data value conflictsfor the same real world entity, attribute values from different sources are differentpossible reasons: different representations, different scales, e.g., metric vs. British unitsHandling Redundant DataRedundant data occur often when integration of multiple databasesThe same attribute may have different names in different databasesCareful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and qualityData TransformationSmoothing: remove noise from dataAggregation: summarization, data cube constructionGeneralization: concept hierarchy climbingNormalization: scaled to fall within a small, specified rangemin-max normalizationz-score normalizationnormalization by decimal scalingData Transformation: Normalizationmin-max normalizationz-score
View Full Document