UNL PSYC 942 - Transformations & Data Cleaning - D2695551

Home> Schools> University of Nebraska-Lincoln> (PSYC) > PSYC 942> Transformations & Data Cleaning

UNL PSYC 942 - Transformations & Data Cleaning

School name University of Nebraska-Lincoln

Course Psyc 942- Fundamentals of Research Design and Data Analysis 2

Pages 6

Download Save

Unformatted text preview:

Transformations & Data Cleaning• Linear & non-linear transformations• 2-kinds of Z-scores• Identifying Outliers & Influential Cases• Univariate Outlier Analyses -- trimming vs. Winsorizing• Outlier Analyses for r & ANOVA• Outlier Analyses for Multiple RegressionTransformationsLinear Transformations • transformations that involve only +, - , * and / • used to “re-express” data to enhance communication e.g., using % correct instead of # correct• do not causes changes in NHST, CI or Effect Size results• r, t & F results will be same before and after transformation•but if doing t/F, be sure to transform all scores around the overall mean, not to transform each group’s scores around their own meanNonlinear Transformations • transformations involving other operations (e.g., ^2, √ & log)• used to “symmetrize” data distributions to improve their fit to assumptions of statistical analysis (i.e., Normal Distribution assumption of r, t & F)• may change r, t & F results -- hope is that transformed results will be more accurate, because data will better fit assumptionsEffect of Linear Transformations on the Mean and StdWe can anticipate the effect on the mean and std of adding, subtracting, multiplying or dividing each value in a data set by a constant.operation effect on mean effect on std+ ? Mean + ? No change- ? Mean - ? No Change* ? Mean * ? Std * ?/ ? Mean / ? Std / ?Commonly Used Linear Transformations & one to watch for ...x - meanZ-score Z = --------------- (m = 0, s = 1) linear stdT-score T = (Z * 10) + 50 (m = 50, s = 10) linearStandard Test S = (Z * 100) + 500 (m = 500, s = 100) lineary’ y’ = (b * x) + a (m=m of y, s ≈ s of y) linear% p_y = (y / max) * 100 linearchange score Δ = “post” score - “pre” score nonlinearA quick word about Z-scores…There are “two kinds” based on ...• mean and std of that sample (M s)• mean and std of the represented population (μσ)X - MZsample= ----------------sX - μZpop= ----------------σ• mean of Z-scores always 0• std of Z-scores always 1• translates relative scores into easily interpreted values• mean > 0 if sample “better” than pop• mean < 0 if sample “poorer” than pop• std < 1 if sample s < σ• std > 1 if sample s > σ• provides ready comparison of sample mean and std to “population values”Similarly, you can compose T and Standard scores using μ and σ to get sample-population comparisons using these score-types.Non-linear transformations -- to “symmetrize” data distributionsthe “transformation needed” is related to the extent & direction of skewingTransformation of negatively skewed distributions first require “reflection”, which involves subtracting all values from the largest value+1. - .80 to - 1.5 reflect & square root transformation-1.5 to -3.0 reflect & log10transformation (positive values only)-3.0 or greater reflect & inverse transformation 1 / # (positive values only)Most transformations are directly applied to positively skewed distributions. +.8 to +1.5 square root transformation (0 & positive values only)+1.5 to +3.0 log10transformation (positive values only)+3.0 or greater inverse transformation 1 / # (positive values only)Skewness Suggested Transformation< +/- .80 unlikely to disrupt common statistical analysesHow “symmetrizing” works…0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 1 2 3Applying √ transformation…√4= 2 √ 9= 3 √16 = 4√25 = 5 √36 = 6 gives ...0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 1 2 3Influential Cases & Outlier AnalysisThe purpose of a sample is to represent a specific population• the better the sample represents the population, the more accurate will be the inferential statistics and the results of any inferential statistical tests• sample members are useful only to the extent that they aid the representation•influential cases are sample members with “extreme” or “non-representative” scores that influence the inferential stats •outliers are cases with values that are “too extreme” to have come from the same population as the rest of the casese.g., the ages of the college sample were 21, 62, 22, 19 & 20 the “62” is an influential case -- will radically increase the mean and std of this sample-- is “too large” to have come from the same pop as restHow outliers influence univariate statistics• outliers can lead to too-high, too-low or nearly correct estimates of the population mean, depending upon the number and location of the outliers (asymmetrical vs. symmetrical patterns)• outliers always lead to overestimates of the population stdMean estimate is “too high” & std is overestimatedMean estimate is “too low” & std is overestimatedMean estimate is “right” & std is overestimatedIdentifying Outliers for removalThe preferred technique is to be able to identify participants who are not likely to be members of the population of interest. However, often the only indication we have that a participant doesn’t come from the population of interest is that they have an “outlying score”.So, we will operate under the assumption that “outlying scores”mean that: 1) a participant is probably not a member of the target population (or that something “bad” happened to their score)2) if included the data as is would produce a biased estimate ofthe population value (e.g., mean, std, r, F, etc.) and so, should not be included in the data analysis.Key -- application of the approach must be “hypothesis blind”Statistical Identification of OutliersOne common procedure was to identify as outliers any data point with a Z-value greater than 2.0 (2.5 & 3.0 were also common suggestions)• this means that some values that really do belong to the distribution were incorrectly identified as outliers, but simulation and real-data research suggested that statistical estimates were better and more replicable when these procedures were used.• one problem with this approach is that outliers

View Full Document


School:
Email:
New Password:
Confirm Password:

UNL PSYC 942 - Transformations & Data Cleaning

Sign up for free to view:

Please select your school