Lecture 2Describing distributions with numbersMeanSlide Number 4Slide Number 5MedianMeasure of center: the medianSlide Number 8Slide Number 9Slide Number 10Measures of spread: QuartilesUsing R:Slide Number 13Slide Number 14Slide Number 15Slide Number 16Slide Number 17Five-Number SummarySlide Number 19R code:Slide Number 21The criterion for suspected outliersSlide Number 23Slide Number 24Calculations …Properties of the standard deviationLinear Transformations: changing units of measurementsSlide Number 28Slide Number 29Effect of a linear transformationSlide Number 31The normal distributionSlide Number 33Slide Number 34Slide Number 35Slide Number 36Slide Number 37Slide Number 38Slide Number 39Slide Number 40Formula (Redundant in this class)Finding probabilities for normal dataSlide Number 43Slide Number 44Slide Number 45Normal quantile plots R- qqnorm()Slide Number 47Newcomb’s data without outliers.Slide Number 49Slide Number 50Looking at Data-RelationshipsSlide Number 52Slide Number 53Slide Number 54 ScatterplotsExplanatory and response variablesSlide Number 58Interpreting scatterplotsForm and direction of an associationSlide Number 61Slide Number 62Strength of the associationSlide Number 64How to scale a scatterplotOutliersSlide Number 67Slide Number 68R Graphical systemSlide Number 70Example 2: Adding categorical variable/grouping (region): e is for northeastern states and m is for midwestern states (others excluded). May enhance understanding of the data.Slide Number 72Categorical variables in scatterplotsSlide Number 74Categorical explanatory variablesExample: Beetles trapped on boards of different colorsScatterplot smoothersCorrelation CoefficientThe correlation coefficient "r"Slide Number 80Slide Number 81“r” does not distinguish x & y"r" has no unit"r" ranges from -1 to +1Slide Number 85Slide Number 86Slide Number 87Slide Number 88Slide Number 89Thought quiz on correlationLecture 2Describing data with graphs and numbers. Normal Distribution. Data relationships.Describing distributions with numbers• Mean• Median• Quartiles• Five number summary. Boxplots• Standard deviationMean• The mean• The arithmetic mean of a data set (average value)• Denoted by x12...1nixx xxxnn+++==∑• Mean highway mileage for 19 2-seaters:Sum: 24+30+….+30=490Divide by n=19 Average: 25.8 miles/gallonProblem: Honda Insight 68miles/gallon!If we exclude it, mean mileage: 23.4 miles/gallon• Mean can be easily influenced by outliers. It is not a robust measure of center.Median• Median is the midpoint of a distribution.• Median is a resistant or robust measure of center.• Not sensitive to extreme observations• In a symmetric distribution mean=median• In a skewed distribution the mean is further out in the long tail than is the median.• Example: house prices: usually right skewed– The mean price of existing houses sold in 2000 in Indiana was 176,200. (Mean chases the right tail)– The median price of these houses was 139,000.Measure of center: the medianThe median is the midpoint of a distribution—the number such that half of the observations are smaller and half are larger. 1. Sort observations by size.n = number of observations______________________________110.6221.2331.6441.9551.5662.1772.3882.3992.510 10 2.811 11 2.912 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.6n = 24 În/2 = 12Median = (3.3+3.4) /2 = 3.352.b. If n is even, the median is the mean of the two middle observations.110.6221.2331.6441.9551.5662.1772.3882.3992.510 10 2.811 11 2.912 12 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.625 12 6.1Í n = 25 (n+1)/2 = 26/2 = 13 Median = 3.42.a. If n is odd, the median is observation (n+1)/2 down the listMean and median for skewed distributionsMean and median for a symmetric distributionLeft skewRight skewMeanMedianMeanMedianMeanMedian Comparing the mean and the medianThe mean and the median are the same only if the distribution is symmetrical. The median is a measure of center that is resistant to skew and outliers. The mean is not.The median, on the other hand, is only slightly pulled to the right by the outliers (from 3.4 to 3.6).The mean is pulled to the right a lot by the outliers (from 3.4 to 4.2).Percent of people dying Mean and median of a distribution with outliers4.3=xWithout the outliers2.4=xWith the outliersDisease X:Mean and median are the same. Mean and median of a symmetric4.34.3==MxMultiple myeloma:5.24.3==Mx… and a right-skewed distributionThe mean is pulled toward the skew. Impact of skewed dataMeasures of spread: Quartiles• Quartiles: Divides data into four parts• p-th percentile – p percent of the observations fall at or below it.• Median – 50-th percentile• Q1-first quartile – 25-th percentile (median of the lower half of data)• Q3-third quartile – 75-th percentile (median of the upper half of data)Using R:• First thing first: import the data. I prefer to use Excel first to save data into a .csv file (comma separated values).• Read the file TA01_008.XLS from the CD and save it as TA01_008.csv• Now R: I like to use tinn-R as the editor. Open tinn-R and save a file in the same directory that you pot the .csv file.• Now go to R/Rgui/ and click Initiate preferred. If everything is configured fine an R window should open• Now type and send line to R:• table1.08=read.csv("TA01_008.csv",header=TRUE)– This will import the data into R also telling R that the first line in the data contains the variable names.– Table1.08 has a “table” structure. To access individual components in it you have to use table1.08$nameofvariable, for example: • table1.08$CarType– Produces:• [1] Two Two Two Two Two Two Two Two Two Two Two Two Two Two Two• [16] Two Two Two Two Mini Mini Mini Mini Mini Mini Mini Mini Mini Mini Mini• Levels: Mini Two– This is a vector and notice that R knows it is a categorical variable.• mean(x) calculates the mean of variable x• median(x) will give the median• In fact you should read section 3.1 in the R textbook for all the functions you will need• summary(data.object) is another useful function. In fact:• summary(table1.08)– CarType City Highway – Mini:11 Min. : 8.00 Min. :13.00 – Two :19 1st Qu.:16.00 1st Qu.:22.25 » Median :18.00 Median :25.50 » Mean :18.90 Mean :25.80 » 3rd
View Full Document