Lecture 2Describing distributions with numbersMeanSlide 4Slide 5MedianMeasures of spread: QuartilesUsing R:Slide 9Slide 10Slide 11Slide 12Five-Number SummaryR code:The criterion for suspected outliersStandard deviationProperties of the standard deviationLinear Transformations: changing units of measurementsSlide 19Slide 20Effect of a linear transformationSlide 22The normal distributionSlide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Finding probabilities for normal dataSlide 33Slide 34Slide 35Normal quantile plots R- qqnorm()Slide 37Newcomb’s data without outliers.Looking at Data-RelationshipsSlide 40Slide 41Slide 42ScatterplotsExample 1: Mean height of a group of children in Kalama, Egypt, plotted against age from 18 to 29 months.Example 2: State mean SAT math score plotted against the percent of HS seniors taking the examSlide 46R Graphical systemSlide 48Example 2: Adding categorical variable/grouping (region): e is for northeastern states and m is for midwestern states (others excluded). May enhance understanding of the data.Slide 50Correlation CoefficientCorrelationSlide 53Formula:Slide 55Lecture 2Describing data with graphs and numbers. Normal Distribution. Data relationships.Describing distributions with numbers•Mean•Median•Quartiles•Boxplots•Standard deviationMean•The mean•The arithmetic mean of a data set (average value)•Denoted by x1 2...1nix x xx xn n+ + += =�•Mean highway mileage for 19 2-seaters:Sum: 24+30+….+30=490Divide by n=19 Average: 25.8 miles/gallonProblem: Honda Insight 68miles/gallon!If we exclude it, mean mileage: 23.4 miles/gallon•Mean can be easily influenced by outliers. It is not a robust measure of center.Median•Median is the midpoint of a distribution.•Median is a resistant or robust measure of center.•Not sensitive to extreme observations•In a symmetric distribution mean=median•In a skewed distribution the mean is further out in the long tail than is the median.•Example: house prices: usually right skewed–The mean price of existing houses sold in 2000 in Indiana was 176,200. (Mean chases the right tail)–The median price of these houses was 139,000.Measures of spread: Quartiles•Quartiles: Divides data into four parts•p-th percentile – p percent of the observations fall at or below it.•Median – 50-th percentile•Q1-first quartile – 25-th percentile (median of the lower half of data)•Q3-third quartile – 75-th percentile (median of the upper half of data)Using R:•First thing first import data. I prefer to use Excel first to save data into a .csv file (comma separated values).•Read the file TA01_008.XLS from the CD and save it as TA01_008.csv•Now R: I like to use tinn-R as the editor. Open tinn-R and save a file in the same directory that you pot the .csv file.•Now go to R/Rgui/ and click Initiate preferred. If everything is configured fine an R window should open•Now type and send line to R:•table1.08=read.csv("TA01_008.csv",header=TRUE)–This will import the data into R also telling R that the first line in the data contains the variable names.–Table1.08 has a “table” structure. To access individual components in it you have to use table1.08$nameofvariable, for example: •table1.08$CarType–Produces:•[1] Two Two Two Two Two Two Two Two Two Two Two Two Two Two Two •[16] Two Two Two Two Mini Mini Mini Mini Mini Mini Mini Mini Mini Mini Mini•Levels: Mini Two–This is a vector and notice that R knows it is a categorical variable.•mean(x) calculates the mean of variable x•median(x) will give the median•In fact you should read section 3.1 in the R textbook for all the functions you will need•summary(data.object) is another useful function. In fact:•summary(table1.08)– CarType City Highway – Mini:11 Min. : 8.00 Min. :13.00 – Two :19 1st Qu.:16.00 1st Qu.:22.25 » Median :18.00 Median :25.50 » Mean :18.90 Mean :25.80 » 3rd Qu.:20.75 3rd Qu.:28.00 » Max. :61.00 Max. :68.00•Lastly if you wish to apply functions only for the part of the dataframe that contains Mini cars:•tapply(table1.08$City,table1.08$CarType,mean)– Mini Two –18.36364 19.21053•The tapply call takes the table1.08$City variable, splits it according to table1.08$CarType variable levels and calculates the function mean for each group.•In the same way you can try:•tapply(table1.08$City,table1.08$CarType,summary)Five-Number Summary•Minimum Q1 Median Q3 Maximum•Boxplot – visual representation of the five-number summary.–Central box: Q1 to Q3.–Line inside box: Median–Extended straight lines: lowest to highest observation, except outliers–Outliers marked as circles or stars.•To make Boxplots in R use function•boxplot(x)R code:•boxplot(table1.08$City)•boxplot(table1.08$Highway)•boxplot(table1.08$City~table1.08$CarType)•boxplot(table1.08$Highway~table1.08$CarType)•par(mfrow=c(1,2))•boxplot(table1.08$City~table1.08$CarType)•boxplot(table1.08$Highway~table1.08$CarType)•par(mfrow=c(1,1))The criterion for suspected outliers•The interquartile range – IQR=Q3-Q1•An observation is a suspected outlier if it falls more then 1.5*IQR above the third quartile or below the first quartile.Standard deviation•Deviation : •Variance : s22 2 22 21 22 2( ) ( ) ... ( )1( )1 1standard deviation : s 1s = ( )1niix x x x x xs x xn ns x xn- + - + + -= = -- -= --��ix x-Properties of the standard deviation•Standard deviation is always non-negative•s=0 when there is no spread•s is not resistant to presence of outliers•The five-number summary usually better describes a skewed distribution or a distribution with outliers.•Mean and standard deviation are usually used for reasonably symmetric distributions without outliers.Linear Transformations: changing units of measurements•xnew=a+bxold•Common conversions •xmiles=0.62 xkm Distance=100km is equivalent to 62 miles•xg=28.35 xoz , 5 160 5( 32)9 9 9celsius fahr fahrx x x= - =- +•Linear transformations do not change the shape of a distribution.•They however change the center and the spread e.g: weights of newly hatched pythons (Example 1.21)PythonWeight1 2 3 4 5oz 1.13 1.02 1.23 1.06 1.16g 32 29 35 30 33•python.oz=c(1.13, 1.02,1.23,1.06,1.16)•python.g=28.35*python.oz•mean(python.oz)•mean(python.g)•sd(python.oz)•sd(python.g)•You could of course
View Full Document