DOC PREVIEW
NYU CSCI-GA 3033 - Data Mining - Data Preparation

This preview shows page 1-2-17-18-19-36-37 out of 37 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Data Mining: Data PreparationData PreprocessingWhy Data Preprocessing?Multi-Dimensional Measure of Data QualityMajor Tasks in Data PreprocessingForms of data preprocessingSlide 7Data CleaningMissing DataHow to Handle Missing Data?Noisy DataHow to Handle Noisy Data?Simple Discretization Methods: BinningBinning Methods for Data SmoothingSlide 15Data IntegrationHandling Redundant DataData TransformationData Transformation: NormalizationSlide 20Data Reduction StrategiesData Cube AggregationDimensionality ReductionSlide 24Regression and Log-Linear ModelsRegress Analysis and Log-Linear ModelsHistogramsClusteringSamplingSlide 31Slide 32DiscretizationDiscretization and Concept hierachyDiscretization for numeric dataSlide 36SummaryReferencesData Mining: Data PreparationData PreprocessingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummaryWhy Data Preprocessing?Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate datanoisy: containing errors or outliersinconsistent: containing discrepancies in codes or namesNo quality data, no quality mining results!Quality decisions must be based on quality dataData warehouse needs consistent integration of quality dataMulti-Dimensional Measure of Data QualityA well-accepted multidimensional view:AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibilityMajor Tasks in Data PreprocessingData cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistenciesData integrationIntegration of multiple databases, data cubes, or filesData transformationNormalization and aggregationData reductionObtains reduced representation in volume but produces the same or similar analytical resultsData discretizationPart of data reduction but with particular importance, especially for numerical dataForms of data preprocessingData PreprocessingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummaryData CleaningData cleaning tasksFill in missing valuesIdentify outliers and smooth out noisy data Correct inconsistent dataMissing DataData is not always availableE.g., many tuples have no recorded value for several attributes, such as customer income in sales dataMissing data may be due to equipment malfunctioninconsistent with other recorded data and thus deleteddata not entered due to misunderstandingcertain data may not be considered important at the time of entrynot register history or changes of the dataMissing data may need to be inferred.How to Handle Missing Data?Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably)Fill in the missing value manually: tedious + infeasible?Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! Use the attribute mean to fill in the missing valueUse the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision treeNoisy DataNoise: random error or variance in a measured variableIncorrect attribute values may due tofaulty data collection instrumentsdata entry problemsdata transmission problemstechnology limitationinconsistency in naming convention Other data problems which requires data cleaningduplicate recordsincomplete datainconsistent dataHow to Handle Noisy Data?Binning method:first sort data and partition into (equi-depth) binsthen smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.Clusteringdetect and remove outliersCombined computer and human inspectiondetect suspicious values and check by humanRegressionsmooth by fitting the data into regression functionsSimple Discretization Methods: BinningEqual-width (distance) partitioning:It divides the range into N intervals of equal size: uniform gridif A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.The most straightforwardBut outliers may dominate presentationSkewed data is not handled well.Equal-depth (frequency) partitioning:It divides the range into N intervals, each containing approximately same number of samplesGood data scalingManaging categorical attributes can be tricky.Binning Methods for Data Smoothing* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34Data PreprocessingWhy preprocess the data?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummaryData IntegrationData integration: combines data from multiple sources into a coherent storeSchema integrationintegrate metadata from different sourcesEntity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-#Detecting and resolving data value conflictsfor the same real world entity, attribute values from different sources are differentpossible reasons: different representations, different scales, e.g., metric vs. British unitsHandling Redundant DataRedundant data occur often when integration of multiple databasesThe same attribute may have different names in different databasesCareful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and qualityData TransformationSmoothing: remove noise from dataAggregation: summarization, data cube constructionGeneralization: concept hierarchy climbingNormalization: scaled to fall within a small, specified rangemin-max normalizationz-score normalizationnormalization by decimal scalingData Transformation: Normalizationmin-max normalizationz-score


View Full Document

NYU CSCI-GA 3033 - Data Mining - Data Preparation

Documents in this Course
Design

Design

2 pages

Real Time

Real Time

17 pages

Load more
Download Data Mining - Data Preparation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Data Mining - Data Preparation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Data Mining - Data Preparation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?