1/25/11 Lecture 4 1 STOR 155 Introductory Statistics Lecture 4: Displaying Distributions with Numbers (II) The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL1/25/11 Lecture 4 2 Numerical Summary for Distributions • Center – Mean – Median – Mode • Spread – Quartiles, IQR, Five-number summary and Boxplot – Standard Deviation (starting from page14)1/25/11 Lecture 4 3 Examples: 2004 Two-Seater Cars • Highway mileages of the 21 two-seater cars: 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32 66 • Q1 =18 • Q3 =28 • IQR = Q3 – Q1=10 • 1.5*IQR=15 • Q3+1.5*IQR=43 • Q1-1.5*IQR=3 • 66 is a suspected outlier.1/25/11 Lecture 4 4 The five-number summary • To get a quick summary of both center and spread, use the following five-number summary: Minimum Q1 M Q3 Maximum1/25/11 Lecture 4 5 Example: HWY Gas Mileage of 2004 Two-seater/Mini Cars • Two-seater – Five-number summary: • 13, 18, 23, 27, 32 • Mini-compact – Five-number summary: • 19, 23, 26, 29, 321/25/11 Lecture 4 6 Boxplots • a visual representation of the five-number summary. • A boxplot consists of – A central box spans the quartiles Q1 and Q3. – A line inside the box marks the median M. – Lines extend from the box out to the smallest and largest observations.1/25/11 Lecture 4 7 Boxplots of highway/city gas mileages (Two-seaters/minicompacts)1/25/11 Lecture 4 8 Pros and cons of Boxplots • Location of the median line in the box indicates symmetry/asymmetry. • Best used for side-by-side comparison of more than one distribution at a glance. • Less detailed than histograms or stem plots. • The box focuses attention on the central half of the data.1/25/11 Lecture 4 9 Income for different Education Level1/25/11 Lecture 4 10 Modified Boxplot • The current boxplot can not reveal those possible outliers. • To modify it, – the two lines extend out from the central box only to the smallest and largest observations that are not suspected outliers. – Observations more than 1.5*IQR outside the box are plotted as individual points.1/25/11 Lecture 4 11 Call length (seconds)1/25/11 Lecture 4 12 HG for count in a given time interval1/25/11 Lecture 4 131/25/11 Lecture 4 14 Sample Variance s2 • Deviation from mean: :the difference between an observation and the sample mean: • Sample Variance s2: the average of squares of the deviations of the observations from their mean xxi1)(1)(...)()(12222212nxxnxxxxxxsniin1/25/11 Lecture 4 15 Sample Standard Deviation s • Sample Standard Deviation s: the square root of the sample variance 1)(12nxxsnii1/25/11 Lecture 4 16 Toy Examples • Data: -2, -1, 0, 1, 2 • What is the sample variance and the standard deviation? • How about this? 40, 40, 40, 40, 401/25/11 Lecture 4 17 Remarks on the definition of Standard Deviation (S.D.) • The sum of the deviations of the obs from their mean is always 0. • Why “square the deviations” rather than “absolute deviations”? – Mean is a natural center under the “squaring”. – S.D. is a natural measure of spread for the normal distributions.1/25/11 Lecture 4 18 Remarks on S.D. • Why “S.D.” rather than “variance”? – S.D. is natural for measuring spread for normal dist. – S.D. is in the original scale. • Why “n-1” rather than “n”? – Intuitively speaking, S.D. is not defined for n=1. – Sum of deviations is always 0, which means “if we know (n-1) of them, we know the last one”. – Only (n-1) deviations can change freely. – n-1: degrees of freedom.1/25/11 Lecture 4 19 Properties of the standard deviation (S.D.) s • s measures the spread about the mean; • s should be used only when the mean is chosen to measure the center; • s=0 if and only if there is no spread; – When? • s>0 almost always, increases with more spread; • s, like the mean, is not resistant, i.e. sensitive to outliers.1/25/11 Lecture 4 20 Examples: 2004 Two-seater Cars Highway mileages of the 21 two-seater cars: 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32 66 • Gasoline-powered cars – Mean: 22.6 – S.D.=5.3 • All cars – Mean: 24.7 – S.D.=10.81/25/11 Lecture 4 21 Three measures of spread • The range is the spread of all the observations; • The interquartile range is the spread of (roughly) the middle 50% of the observations; • S.D. is a measure of the distance from sample mean. S.D. can be regarded as a “typical” distance of the observations from their mean.1/25/11 Lecture 4 22 The five-number summary vs Mean and S.D. • The five-number summary is preferred for a skewed distribution or a distribution with strong outliers. • and s are preferred for reasonably symmetric distributions that are free of outliers. • Always plot your data first. • Use boxplots. x1/25/11 Lecture 4 23 Changing the unit of measurement • The same variable can be recorded in different units of measurement. • Distance: – Miles (US) vs Kilometers (Elsewhere) – 1 mile = 1.6 km – 1 km = ? mile • Temperature – Fahrenheit (US) vs Celsius (Elsewhere) – 0 F = -17.8 C – 100 F = 37.8 C – 212 F =100 C1/25/11 Lecture 4 24 Boiled Billy • An Australian student Billy has recently been on a trip to the States. Soon after he arrived there, he caught a cold and had a fever. • He went to see Doctor Z. Doctor Z measured his body temperature and told Billy, “Just relax! No big deal! It’s only a little above 100 degree!” • “100!!!”, Billy yelled, “How can you say it’s not a big deal? I am boiled…”1/25/11 Lecture 4 25 Linear Transformation • A linear transformation changes the original variable into a new variable according to the following equation, • Temperature: Celsius vs Fahrenheit – in Celsius, in Fahrenheit, – How about the inverse transformation? xnewx.bxaxnewxnewx.5932 xxnew1/25/11 Lecture 4 26 Effects of Linear Transformation • The shape of a distribution remains unchanged, except that the direction of the skewness might change. – When? • Measures of center and spread change. – Multiplying each obs by a positive number b multiplies both measures of center and spread by b; – Adding the same number a to each obs adds a to measures of center and to percentiles, but does not change measures of spread.1/25/11 Lecture 4 27 Example:
View Full Document