5/12/11 Lecture 3 1 STOR 155 Introductory Statistics Lecture 3: Displaying Distributions with Numbers The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL5/12/11 Lecture 3 2 Exploratory Data Analysis (EDA) • Graphical Visualization: Shape – Bar Graph – Pie Chart – Stem plot – Histogram – Time plot (not for distribution, but for changing pattern over time) • Numerical Summary: Center and Spread – Center: • Mean, Median and Mode – Spread: • Quartiles, Five-number summary and Boxplot • Standard Deviation5/12/11 Lecture 3 3 • What is the average highway (city) mileage? • What is the ``middle value’’ of highway (city) mileage?5/12/11 Lecture 3 4 Measuring center: the mean • Mean = average value • The sample mean : If the n observations in a sample are , then their mean is xinnxnxxxx121/)...(nxxx ,...,21x5/12/11 Lecture 3 5 Measuring center: the median5/12/11 Lecture 3 6 Example: Fuel economy (miles per gallon) for 2004 two-seater cars • Look at the Highway mileage (without Honda Insight): – Mean – Median • How about with Honda Insight? – Mean – Median • What can you say?5/12/11 Lecture 3 7 Example: Salary Survey of UNC Graduates • Survey a certain number of graduates from UNC. • A lot of departments are surveyed. • Question: – Which department produces students that earn the most on average ten years after they got their degrees? • Answer: – Geography!!!!?????? – Michael Jordan5/12/11 Lecture 3 8 Mean vs Median • Mean: – easy to calculate – easy to work with algebraically – highly affected by outliers – Not a resistant measure • Median: – can be time consuming to calculate – more resistant to a few extreme observations (sometimes outliers) – robust5/12/11 Lecture 3 9 Mode • The most frequent value in the data • Important for categorical data • Possible to have more than one mode5/12/11 Lecture 3 10 Mean, Median and Mode • If the distribution is exactly symmetric and unimodal, the mean, the median and the mode are exactly the same. • If the distribution is skewed, the three measures differ. Mean Median Mode Mean Median Mode5/12/11 Lecture 3 11 Which one to use? • Different by definition – Mean and median are unique, and only for quantitative variables. – Mode may not be unique. – Mode is defined for categorical variables also. • The choice depends on the shape of the distribution, the type of data and the purpose of your study – Skewed: median – Categorical: mode – Total quantity: mean – …5/12/11 Lecture 3 12 Numerical Summary for Distributions • Center – Mean – Median – Mode • Spread – Quartiles, Five-number summary and Boxplot – Standard Deviation5/12/11 Lecture 3 13 Why do we need “Spread”? • Knowing the center of a distribution alone is not a good enough description of the data. – Two basketball players with the same shooting percentage may be very different in terms of consistency. – Two companies may have the same average salary, but very different distributions. • We need to know the spread, or the variability of the values.5/12/11 Lecture 3 14 A raw measure: Range • Range = maximum - minimum • Depends only on two values • Tends to increase with larger samples • Affected by outliers – Not robust5/12/11 Lecture 3 15 Percentiles • Percentiles are derived from the ordered data values. • The pth percentile is the value such that p percent of the observations fall at or below it. • The median = the 50th percentile.5/12/11 Lecture 3 16 • The sample quartiles are the values that divide the sorted sample into quarters, just as the median divides it into half. • The most commonly used quantiles are – The median M = 50th percentile – The 1st (lower) quartile Q1 = 25th percentile – The 3rd (upper) quartile Q3 = 75th percentile Quartiles5/12/11 Lecture 3 17 Calculations of Quartiles5/12/11 Lecture 3 18 Examples: 2004 Gasoline-powered Two-seater Cars • Highway mileages of the 20 gasoline-powered two-seater cars: 13 15 16 16 17 19 20 22 23 23 | 23 24 25 25 26 28 28 28 29 32 • Q1 = Median of {13 15 16 16 17 19 20 22 23 23 } = • Q3 = median of {23 24 25 25 26 28 28 28 29 32} =5/12/11 Lecture 3 19 Interquartile Range: IQR • IQR = Q3 – Q1 – The range of the center half of the data – A resistant measure for spread • IQR can be used to identify suspected outliers. • Rule-of-thumb: – An observation is called a suspected outlier if it falls more than 1.5*IQR above Q3 or below Q1.5/12/11 Lecture 3 20 Examples: 2004 Gasoline-powered Two-seater Cars • Highway mileages of the 20 gasoline-powered two-seater cars: 13 15 16 16 17 19 20 22 23 23 | 23 24 25 25 26 28 28 28 29 32 • IQR = Q3 – Q1= • 1.5*IQR= • Q3+1.5*IQR= • Q1-1.5*IQR= • Any suspected outliers?5/12/11 Lecture 3 21 Examples: 2004 Two-Seater Cars • Highway mileages of the 21 two-seater cars: 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32 66 • Q1 = • Q3 = • IQR = Q3 – Q1= • 1.5*IQR= • Q3+1.5*IQR= • Q1-1.5*IQR= • Any suspected outliers?5/12/11 Lecture 3 22 The five-number summary • To get a quick summary of both center and spread, use the following five-number summary: Minimum Q1 M Q3 Maximum5/12/11 Lecture 3 23 Example: HWY Gas Mileage of 2004 Two-seater/Mini Cars • Two-seater – Five-number summary: • 13, 18, 23, 27, 32 • Mini-compact (the other half of Fig. 1.10) – Five-number summary: • 19, 23, 26, 29, 325/12/11 Lecture 3 24 Boxplots • a visual representation of the five-number summary. • A boxplot consists of – A central box spans the quartiles Q1 and Q3. – A line inside the box marks the median M. – Lines extend from the box out to the smallest and largest observations.5/12/11 Lecture 3 25 Boxplots of highway/city gas mileages (Two-seaters/minicompacts)5/12/11 Lecture 3 26 Pros and cons of Boxplots • Location of the median line in the box indicates symmetry/asymmetry. • Best used for side-by-side comparison of more than one distribution at a glance. • Less detailed than histograms or stem plots. • The box focuses attention on the central half of the data.5/12/11 Lecture 3 27 Income for different Education Level5/12/11 Lecture 3 28 Modified Boxplot • The current boxplot can not reveal those possible outliers. • To modify it, – the two lines extend out from the
View Full Document