The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL STAT 155 Introductory Statistics Lecture 3 Displaying Distributions with Numbers 9 5 06 Lecture 3 1 Exploratory Data Analysis EDA Graphical Visualization Shape Bar Graph Pie Chart Stem plot Histogram Time plot not for distribution but for changing pattern over time Numerical Summary Center and Spread Center Mean Median and Mode Spread Quartiles Five number summary and Boxplot Standard Deviation Choose one from each category 9 5 06 Lecture 3 2 What is the average highway city mileage What is the middle value of highway city mileage 9 5 06 Lecture 3 3 Measuring center the mean x Mean Average value The sample mean x If the n observations in a sample are x 1 x 2 x n then their mean is x x1 x2 xn n 9 5 06 1 n x i Lecture 3 4 Measuring center the median 9 5 06 Lecture 3 5 Example Fuel economy miles per gallon for 2004 two seater cars Look at the Highway mileage w o Honda Insight Mean Median How about with Honda Insight Mean Median What can you say 9 5 06 Lecture 3 6 Example Salary Survey of UNC Graduates Survey a certain number of graduates from UNC A lot of departments are surveyed Question Which department produces students that earn the most on average ten years after they got their degrees Answer Geography Michael Jordan 9 5 06 Lecture 3 7 Mean vs Median Mean easy to calculate easy to work with algebraically highly affected by outliers Not a resistant measure Median can be time consuming to calculate more resistant to a few extreme observations sometimes outliers robust 9 5 06 Lecture 3 8 Mode The most frequent value in the data Important for categorical data Possible to have more than one mode 9 5 06 Lecture 3 9 Mean Median and Mode If the distribution is exactly symmetric and unimodal the mean the median and the mode are exactly the same If the distribution is skewed the three measures differ Mode 9 5 06 Mean Median Mean Mode Median Lecture 3 10 Which one to use Different by definition Mean and median are unique and only for quantitative variables Mode is not unique Mode is defined for categorical variables also The choice depends on the shape of the distribution the type of data and the purpose of your study 9 5 06 Skewed median Categorical mode Total quantity mean Lecture 3 11 Numerical Summary for Distributions Center Mean Median Mode Spread Quartiles Five number summary and Boxplot Standard Deviation 9 5 06 Lecture 3 12 Why do we need Spread Knowing the center of a distribution alone is not a good enough description of the data Two basketball players with the same shooting percentage may be very different in terms of consistency Two companies may have the same average salary but very different distributions We need to know the spread or the variability of the values 9 5 06 Lecture 3 13 A raw measure Range Range maximum minimum Depends only on two values Tends to increase with larger samples Affected by outliers Not robust 9 5 06 Lecture 3 14 Percentiles Percentiles are derived from the ordered data values The pth percentile is the value such that p percent of the observations fall at or below it The median the 50th percentile 9 5 06 Lecture 3 15 Quartiles The sample quartiles are the values that divide the sorted sample into quarters just as the median divides it into half The most commonly used quantiles are The median M 50th percentile The 1st lower quartile Q1 25th percentile The 3rd upper quartile Q3 75th percentile 9 5 06 Lecture 3 16 Calculations of Quartiles 9 5 06 Lecture 3 17 Examples 2004 Gasoline powered Two seater Cars The highway mileages of the 20 gasoline powered two seater cars 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32 Q1 Median of 13 15 16 16 17 19 20 22 23 23 Q3 median of 23 24 25 25 26 28 28 28 29 32 9 5 06 Lecture 3 18 Interquartile Range IQR IQR Q3 Q1 The range of the center half of the data A resistant measure for spread IQR can be used to identify suspected outliers Rule of thumb An observation is called a suspected outlier if it falls more than 1 5 IQR above Q3 or below Q1 9 5 06 Lecture 3 19 Examples 2004 Gasoline powered Two seater Cars The highway mileages of the 20 gasoline powered two seater cars 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32 IQR Q3 Q1 1 5 IQR Q3 1 5 IQR Q1 1 5 IQR Any suspected outliers 9 5 06 Lecture 3 20 Examples 2004 Two Seater Cars The highway mileages of the 21two seater cars 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32 66 Q1 Q3 IQR Q3 Q1 1 5 IQR Q3 1 5 IQR Q1 1 5 IQR Any suspected outliers 9 5 06 Lecture 3 21 The five number summary To get a quick summary of both center and spread use the following five number summary Minimum Q1 M Q3 Maximum 9 5 06 Lecture 3 22 Example HWY Gas Mileage of 2004 Two seater Mini Cars Two seater Five number summary 13 18 23 27 32 Mini compact Five number summary 19 23 26 29 32 9 5 06 Lecture 3 23 Boxplots a visual representation of the five number summary A boxplot consists of A central box spans the quartiles Q1 and Q3 A line inside the box marks the median M Lines extend from the box out to the smallest and largest observations 9 5 06 Lecture 3 24 Boxplots of highway city gas mileages Two seaters minicompacts 9 5 06 Lecture 3 25 Pros and cons of Boxplots Location of the median line in the box indicates symmetry asymmetry Best used for side by side comparison of more than one distribution at a glance Less detailed than histograms or stem plots The box focuses attention on the central half of the data 9 5 06 Lecture 3 26 Income for different Education Level 9 5 06 Lecture 3 27 Modified Boxplot The current boxplot can not reveal those possible outliers To modify it the two lines extend out from the central box only to the smallest and largest observations that are not suspected outliers Observations more than 1 5 IQR outside the box are plotted as individual points 9 5 06 Lecture 3 28 Call length seconds 9 5 06 Lecture 3 29 Call length seconds 9 5 06 Lecture 3 30 9 5 06 Lecture 3 31 Take Home Message Center Mean Median Mode Spread Quartiles Q1 Q3 and the median IQR Identify possible outliers Five number summary Boxplots and modified boxplots Pros and cons 9 5 06 Lecture 3 32
View Full Document