The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL STAT 155 Introductory Statistics Lecture 4 Displaying Distributions with Numbers II 9 7 06 Lecture 4 1 Numerical Summary for Distributions Center Mean Median Mode Spread Quartiles IQR Five number summary and Boxplot Standard Deviation 9 7 06 Lecture 4 2 Examples 2004 Two Seater Cars The highway mileages of the 21 two seater cars 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32 66 Q1 18 Q3 28 IQR Q3 Q1 10 1 5 IQR 15 Q3 1 5 IQR 43 Q1 1 5 IQR 3 66 is a suspected outlier 9 7 06 Lecture 4 3 The five number summary To get a quick summary of both center and spread use the following five number summary Minimum Q1 M Q3 Maximum 9 7 06 Lecture 4 4 Example HWY Gas Mileage of 2004 Two seater Mini Cars Two seater Five number summary 13 18 23 27 32 Mini compact Five number summary 19 23 26 29 32 9 7 06 Lecture 4 5 Boxplots a visual representation of the five number summary A boxplot consists of A central box spans the quartiles Q1 and Q3 A line inside the box marks the median M Lines extend from the box out to the smallest and largest observations 9 7 06 Lecture 4 6 Boxplots of highway city gas mileages Two seaters minicompacts 9 7 06 Lecture 4 7 Pros and cons of Boxplots Location of the median line in the box indicates symmetry asymmetry Best used for side by side comparison of more than one distribution at a glance Less detailed than histograms or stem plots The box focuses attention on the central half of the data 9 7 06 Lecture 4 8 Income for different Education Level 9 7 06 Lecture 4 9 Modified Boxplot The current boxplot can not reveal those possible outliers To modify it the two lines extend out from the central box only to the smallest and largest observations that are not suspected outliers Observations more than 1 5 IQR outside the box are plotted as individual points 9 7 06 Lecture 4 10 Call length seconds 9 7 06 Lecture 4 11 Call length seconds 9 7 06 Lecture 4 12 9 7 06 Lecture 4 13 Sample Variance s2 Deviation from mean the difference between an observation and the sample mean xi x Sample Variance s2 the average of the squares of the deviations of the observations from their mean 2 2 2 x x x x x x n 2 s2 1 n 1 n 9 7 06 2 x x i i 1 n 1 Lecture 4 14 Sample Standard Deviation s Sample Standard Deviation s the square root of the sample variance n s 9 7 06 i 1 xi x 2 n 1 Lecture 4 15 Toy Examples Data 2 1 0 1 2 What is the sample variance and the standard deviation How about this 40 40 40 40 40 9 7 06 Lecture 4 16 Remarks on the definition of Standard Deviation S D The sum of the deviations of the obs from their mean is always 0 Why square the deviations rather than absolute deviations Mean is a natural center under the squaring S D is a natural measure of spread for the normal distributions 9 7 06 Lecture 4 17 Remarks on S D Why S D rather than variance S D is natural for measuring spread for normal dist S D is in the original scale Why n 1 rather than n Intuitively speaking S D is not defined for n 1 Sum of deviations is always 0 which means if we know n 1 of them we know the last one Only n 1 deviations can change freely n 1 degrees of freedom 9 7 06 Lecture 4 18 Properties of the standard deviation S D s s measures the spread about the mean s should be used only when the mean is chosen to measure the center s 0 if and only if there is no spread When s 0 elsewhere increases when more spread s like the mean is not resistant Even less resistant Why 9 7 06 Lecture 4 19 Examples 2004 Two seater Cars The highway mileages of the 21 two seater cars 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32 66 Gasoline powered cars Mean 22 6 S D 5 3 All cars Mean 24 7 S D 10 8 9 7 06 Lecture 4 20 Three measures of spread The range is the spread of all the observations The interquartile range is the spread of roughly the middle 50 of the observations S D is a measure of the distance from sample mean S D can be regarded as a typical distance of the observations from their mean 9 7 06 Lecture 4 21 The five number summary vs Mean and S D The five number summary is preferred for a skewed distribution or a distribution with strong outliers x and s are preferred for reasonably symmetric distributions that are free of outliers Always plot your data first Use boxplots 9 7 06 Lecture 4 22 Changing the unit of measurement The same variable can be recorded in different units of measurement Distance Miles US vs Kilometers Elsewhere 1 mile 1 6 km 1 km mile Temperature 9 7 06 Fahrenheit US vs Celsius Elsewhere 0 F 17 8 C 100 F 37 8 C 212 F 100 C Lecture 4 23 Boiled Billy An Australian student Billy has recently been on a trip to the States Soon after he arrived there he caught a cold and had a fever He went to see Doctor Z Doctor Z measured his body temperature and told Billy Just relax No big deal It s only a little above 100 degree 100 Billy yelled How can you say it s not a big deal I am boiled 9 7 06 Lecture 4 24 Linear Transformation A linear transformation changes the original variable x into a new variable xnew according to the following equation xnew a bx Temperature Celsius vs Fahrenheit x xnew in Fahrenheit 9 xnew 32 x 5 in Celsius How about the reverse transformation 9 7 06 Lecture 4 25 Effects of Linear Transformation The shape of a distribution remains unchanged except that the direction of the skewness might change When Measures of center and spread change Multiplying each obs by a positive number b multiplies both measures of center and spread by b Adding the same number a to each obs adds a to measures of center and to percentiles but does not change measures of spread 9 7 06 Lecture 4 26 Example Salary Raise A sample was taken of the salaries of 20 employees of a large company Suppose everyone will receive a 3000 increase then how will the standard deviation of the salaries change How about the mean How about the median How about Q1 and Q3 9 7 06 Lecture 4 …
View Full Document