UC STAT 2037 - 3. Data Pre-Processing - D3619728

Home> Schools> University of Cincinnati> Statistics (STAT) > STAT 2037> 3. Data Pre-Processing

UC STAT 2037 - 3. Data Pre-Processing

Course Stat 2037- Prob & Stats I

Pages 38

Download Save

Unformatted text preview:

Data Preprocessing Laura Portell 1 What is data preprocessing Data preprocessing involves transforming raw data to well formed data sets Raw data is often incomplete and has inconsistent formatting The good preprocessing of data has a direct correlation with the success of any project that involve data analytics 2 Major task in Data Preprocessing 1 Data cleaning 1 Data integration 1 Data reduction 1 Data transformation 3 1 Data cleaning Data cleaning is the process to remove incorrect data incomplete data and inaccurate data from the datasets and to replaces the missing values There are some techniques in data cleaning 1 1 Missing data 1 2 Outliers 4 1 1 Data cleaning Missing Data Missing data occur when no data value is stored for the variable in an observation Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data Do I have missing data COUNTA range of data Counts non blank cells or valid data points within a range COUNTBLANK range of data Counts blank cells or missing data points within a range It can be handled in various ways 5 1 1 Data cleaning Missing Data Ignore the tuples This is usually done when some labels are missing This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple NAN v1 18 12 8 9 id1 id2 id3 id4 id5 v2 2 3 NAN 12 v3 17 4 11 7 NAN NAN v1 18 8 id1 id4 v2 2 12 v3 17 7 6 1 1 Data cleaning Missing Data Fill the Missing values You can choose to fill the missing values manually by attribute mean or the most probable value NAN v1 18 12 8 9 id1 id2 id3 id4 id5 v2 2 3 NAN 12 v3 17 4 11 7 NAN NAN v1 18 12 12 8 9 id1 id2 id3 id4 id5 v2 2 3 6 6 12 v3 17 4 11 7 10 7 1 1 Data cleaning Missing Data Use a global constant to fill in the missing value Replace all missing values by the same constant NAN v1 18 12 8 9 id1 id2 id3 id4 id5 v2 2 3 NAN 12 v3 17 4 11 7 NAN NAN v1 18 0 12 8 9 id1 id2 id3 id4 id5 v2 2 3 0 12 0 v3 17 4 11 7 0 8 1 1 Data cleaning Missing Data Use k Nearest Neighbor to fill in the missing value k NN uses feature similarity to predict the new values NAN v1 18 10 10 10 id1 id2 id3 id4 id5 v2 2 2 NAN 12 v3 17 17 11 10 NAN NAN v1 18 18 10 10 10 id1 id2 id3 id4 id5 v2 2 2 12 12 12 v3 17 17 11 10 11 9 1 2 Data cleaning Outliers An outlier is a data point that differs significantly from other observations An outlier may be due to a variability in the measurement an indication of novel data or it may be the result of experimental error Detection of the outliers Using IQR Using Standard deviation 10 1 2 Data cleaning Outliers Using IQR IQR is a robust estimate in the sense that extreme observations tend not to impact it too much especially when there is a large number of observations Calculation steps 1 Create a lower bound for which any value that falls below will be tagged as an outlier Lower bound 25th percentile 1 5 IQR 2 Create an upper bound for which any value that falls above will be tagged as an outlier Upper bound 75th percentile 1 5 IQR 11 Boxplot https towardsdatascience com understanding boxplots 5e2df7bcbd51 12 https seaborn pydata org generated seaborn boxplot html 13 1 2 Data cleaning Outliers Using Standard deviation 1 2 Create a lower bound for which any value that falls below three standard deviations from the mean will be tagged an outlier Lower bound 3 standard deviation 1 Create an upper bound for which any value that falls above three standard deviations from the mean will be tagged an outlier Upper bound 3 standard deviation This approach relates to the 68 95 99 7 rule which says we expect 99 7 percent of observations to fall within three standard deviations of the mean In this case outliers exist beyond that threshold 14 1 2 Data cleaning Impact of Outliers Measures that are less in uenced by the presence of extreme values or outliers Robust vs non robust statistics Robust summary statistics Median 25th percentile or 1st quartile 75th percentile or 3rd quartile Interquartile Range IQR Non robust summary statistics Minimum Maximum Range Mean Standard deviation Measures that are more in uenced by the presence of extreme values or outliers 15 1 2 Data cleaning Impact of Outliers Example 16 1 2 Data cleaning What to do with Outliers Noisy data is a meaningless data that can t be interpreted by machines Noise is a random error or variance in a measured variable It can be generated due to faulty data collection data entry errors etc It can be handled in following ways Binning Regression Clustering 17 1 2 Data cleaning What to do with Outliers Binning Binning methods smooth a sorted data value by consulting its neighborhood that is the values around it The whole data is divided into segments of equal size and then various methods are performed to complete the task 18 1 2 Data cleaning What to do with Outliers Binning example Attributes values 0 3 5 11 15 15 17 19 21 Equi width binning Equi frequency binning Bin 1 10 Bin 2 10 20 Bin 3 20 Bin 1 10 Bin 2 10 16 Bin 3 16 0 3 5 11 12 15 15 19 21 0 3 5 11 15 15 17 19 21 19 1 2 Data cleaning What to do with Outliers Regression Here data can be made smooth by fitting it to a regression function The regression used may be linear or multiple We saw linear regression in Statistics II 20 1 2 Data cleaning What to do with Outliers Clustering This approach groups the similar data in a cluster The outliers may be undetected or fall outside of clusters 21 The process of combining multiple sources into a single dataset The data integration process is one of the main components in data management 2 Data integration id 1 2 3 4 name James Susan Robert Karen age city 30 22 24 35 Argentina Japan Australia Italy profession actor policewoman pilot teacher id 1 2 3 4 id 1 2 3 4 name age city profession James Susan Robert Karen 30 22 24 35 Argentina actor Japan policewoman Australia Italy pilot teacher 22 3 Data reduction When the volume of data is huge databases can become slower costly to access and challenging to properly store Data reduction aims to present a reduced representation of the data in a data warehouse We see Dimensionality Reduction Numerosity Reduction 23 3 1 Data reduction Dimensionality Reduction Dimensionality Reduction Dimensionality reduction eliminates the data …

View Full Document


School:
Email:
New Password:
Confirm Password:

UC STAT 2037 - 3. Data Pre-Processing

Sign up for free to view:

Please select your school