Chapter 2 Data PreprocessingData Types and FormsChapter 2: Data PreprocessingWhy Data Preprocessing?Why Is Data Preprocessing Important?Multi-Dimensional Measure of Data QualityMajor Tasks in Data PreprocessingSlide 8Data CleaningMissing DataHow to Handle Missing Data?Noisy DataHow to Handle Noisy Data?Binning Methods for Data SmoothingOutlier RemovalSlide 16Data IntegrationData TransformationData Transformation: NormalizationSlide 20Data Reduction StrategiesDimensionality ReductionHistogramsClusteringSamplingSlide 26Slide 27DiscretizationDiscretization and Concept HierarchyBinningEntropy-based (1)Entropy-based (2)SummaryUIC - CS 594 1Chapter 2Data PreprocessingUIC - CS 594 2Data Types and FormsAttribute-value data:Data typesnumeric, categorical (see the hierarchy for its relationship) static, dynamic (temporal)Other kinds of datadistributed datatext, Web, meta dataimages, audio/videoUIC - CS 594 3Chapter 2: Data PreprocessingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretizationSummaryUIC - CS 594 4Why Data Preprocessing?Data in the real world is dirtyincomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate datae.g., occupation=“”noisy: containing errors or outlierse.g., Salary=“-10”inconsistent: containing discrepancies in codes or namese.g., Age=“42” Birthday=“03/07/1997”e.g., Was rating “1,2,3”, now rating “A, B, C”e.g., discrepancy between duplicate recordsUIC - CS 594 5Why Is Data Preprocessing Important?No quality data, no quality mining results!Quality decisions must be based on quality datae.g., duplicate or missing data may cause incorrect or even misleading statistics.Data preparation, cleaning, and transformation comprises the majority of the work in a data mining application (90%).UIC - CS 594 6Multi-Dimensional Measure of Data QualityA well-accepted multi-dimensional view:AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibilityUIC - CS 594 7Major Tasks in Data PreprocessingData cleaningFill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistenciesData integrationIntegration of multiple databases, or filesData transformationNormalization and aggregationData reductionObtains reduced representation in volume but produces the same or similar analytical resultsData discretization (for numerical data)UIC - CS 594 8Chapter 2: Data PreprocessingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretizationSummaryUIC - CS 594 9Data CleaningImportance“Data cleaning is the number one problem in data warehousing”Data cleaning tasksFill in missing valuesIdentify outliers and smooth out noisy data Correct inconsistent dataResolve redundancy caused by data integrationUIC - CS 594 10Missing DataData is not always availableE.g., many tuples have no recorded values for several attributes, such as customer income in sales dataMissing data may be due to equipment malfunctioninconsistent with other recorded data and thus deleteddata not entered due to misunderstandingcertain data may not be considered important at the time of entrynot register history or changes of the dataUIC - CS 594 11How to Handle Missing Data?Ignore the tuple Fill in missing values manually: tedious + infeasible?Fill in it automatically witha global constant : e.g., “unknown”, a new class?! the attribute meanthe most probable value: inference-based such as Bayesian formula, decision tree, or EM algorithmUIC - CS 594 12Noisy DataNoise: random error or variance in a measured variable.Incorrect attribute values may due tofaulty data collection instrumentsdata entry problemsdata transmission problemsetcOther data problems which requires data cleaningduplicate records, incomplete data, inconsistent dataUIC - CS 594 13How to Handle Noisy Data?Binning method:first sort data and partition into (equi-depth) binsthen one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.Clusteringdetect and remove outliersCombined computer and human inspectiondetect suspicious values and check by human (e.g., deal with possible outliers)UIC - CS 594 14Binning Methods for Data SmoothingSorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34Partition into (equi-depth) bins:Bin 1: 4, 8, 9, 15Bin 2: 21, 21, 24, 25Bin 3: 26, 28, 29, 34Smoothing by bin means:Bin 1: 9, 9, 9, 9Bin 2: 23, 23, 23, 23Bin 3: 29, 29, 29, 29Smoothing by bin boundaries:Bin 1: 4, 4, 4, 15Bin 2: 21, 21, 25, 25Bin 3: 26, 26, 26, 34UIC - CS 594 15Outlier RemovalData points inconsistent with the majority of dataDifferent outliersValid: CEO’s salary, Noisy: One’s age = 200, widely deviated pointsRemoval methodsClusteringCurve-fittingHypothesis-testing with a given modelUIC - CS 594 16Chapter 2: Data PreprocessingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretizationSummaryUIC - CS 594 17Data IntegrationData integration: combines data from multiple sourcesSchema integrationintegrate metadata from different sourcesEntity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#Detecting and resolving data value conflictsfor the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British unitsRemoving duplicates and redundant dataUIC - CS 594 18Data TransformationSmoothing: remove noise from dataNormalization: scaled to fall within a small, specified rangeAttribute/feature constructionNew attributes constructed from the given onesAggregation: summarizationGeneralization: concept hierarchy climbingUIC - CS 594 19Data Transformation: Normalizationmin-max normalizationz-score normalizationnormalization by decimal scalingAAAAAAminn ewminn ewmaxn ewminmaxminvv _)__(' AAdevstandmeanvv_'jvv10'Where j is the smallest integer such that Max(| |)<1'vUIC - CS 594 20Chapter
View Full Document