Data PreprocessingOutlineKnowledge Discovery (KDD) ProcessKnowledge ProcessWhy Preprocess the dataWhy Data Preprocessing?Why Is Data Dirty?Slide 8Slide 9Why Is Data Preprocessing Important?Major Tasks in Data PreprocessingForms of Data PreprocessingSlide 13Descriptive data summarizationSlide 15Measuring the Central TendencySymmetric vs. Skewed DataMeasuring the Dispersion of DataBoxplot AnalysisSlide 20Histogram AnalysisSlide 22Quantile PlotSlide 24Data PreprocessingDr. Bernard Chen Ph.D.University of Central ArkansasFall 2010OutlineIntroductionDescriptive Data SummarizationData CleaningMissing valueNoise dataData IntegrationRedundancyData TransformationKnowledge Discovery (KDD) ProcessData mining—core of knowledge discovery processData CleaningData IntegrationDatabasesData WarehouseTask-relevant DataSelectionData MiningPattern EvaluationKnowledge Process1. Data cleaning – to remove noise and inconsistent data2. Data integration – to combine multiple source 3. Data selection – to retrieve relevant data for analysis4. Data transformation – to transform data into appropriate form for data mining5. Data mining6. Evaluation7. Knowledge presentationWhy Preprocess the dataImage that you are a manager at ALLElectronics and have been charger with analyzing the company’s dataThen you realize:Several of the attributes for carious tuples have no recorded valueSome information you want is not on recorded Some values are reported as incomplete, noisy, and inconsistentWelcome to real world!!Why Data Preprocessing?Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate datae.g., occupation=“ ”noisy: containing errors or outlierse.g., Salary=“-10”inconsistent: containing discrepancies in codes or namese.g., Age=“42” Birthday=“03/07/1997”e.g., Was rating “1,2,3”, now rating “A, B, C”e.g., discrepancy between duplicate recordsWhy Is Data Dirty?Incomplete data may come from“Not applicable” data value when collectedDifferent considerations between the time when the data was collected and when it is analyzed.Human/hardware/software problemsWhy Is Data Dirty?Noisy data (incorrect values) may come fromFaulty data collection instrumentsHuman or computer error at data entryErrors in data transmissionWhy Is Data Dirty?Inconsistent data may come fromDifferent data sourcesFunctional dependency violation (e.g., modify some linked data)Duplicate records also need data cleaningWhy Is Data Preprocessing Important?No quality data, no quality mining results!Quality decisions must be based on quality datae.g., duplicate or missing data may cause incorrect or even misleading statistics.Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouseMajor Tasks in Data PreprocessingData cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistenciesData integrationIntegration of multiple databases, data cubes, or filesData transformationNormalization and aggregationData reductionObtains reduced representation in volume but produces the same or similar analytical resultsForms of Data PreprocessingOutlineIntroductionDescriptive Data SummarizationData CleaningMissing valueNoise dataData IntegrationRedundancyData TransformationDescriptive data summarizationMotivationTo better understand the data: central tendency, variation and spreadData dispersion characteristics median, max, min, quantiles, outliers, variance, etc.Descriptive data summarizationNumerical dimensions correspond to sorted intervalsData dispersion: analyzed with multiple granularities of precisionBoxplot or quantile analysis on sorted intervalsMeasuring the Central TendencyMeanMedianModeValue that occurs most frequently in the dataDataset with one, two or three modes are respectively called unimodal, bimodal, and trimodalSymmetric vs. Skewed DataMeasuring the Dispersion of DataQuartiles, outliers and boxplotsThe median is the 50th percentile Quartiles: Q1 (25th percentile), Q3 (75th percentile)Inter-quartile range (IQR): IQR = Q3 – Q1 Outlier: usually, a value higher/lower than 1.5 x IQRBoxplot AnalysisFive-number summary of a distribution:Minimum, Q1, M, Q3, MaximumBoxplotData is represented with a boxThe ends of the box are at the first and third quartiles, i.e., the height of the box is IRQThe median is marked by a line within the boxWhiskers: two lines outside the box extend to Minimum and MaximumBoxplot AnalysisHistogram AnalysisGraph displays of basic statistical class descriptionsFrequency histograms A univariate graphical methodConsists of a set of rectangles that reflect the counts or frequencies of the classes present in the given dataHistogram AnalysisQuantile PlotDisplays all of the data (allowing the user to assess both the overall behavior and unusual occurrences)Plots quantile informationFor a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xiQuantile
View Full Document