STEVENS MA 331 - Lecture 2 Describing Data With Graphs and Numbers

Unformatted text preview:

Lecture 2Describing distributions with numbersMeanSlide 4Slide 5MedianMeasures of spread: QuartilesUsing R:Slide 9Slide 10Slide 11Slide 12Five-Number SummaryR code:The criterion for suspected outliersStandard deviationProperties of the standard deviationLinear Transformations: changing units of measurementsSlide 19Slide 20Effect of a linear transformationSlide 22The normal distributionSlide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Finding probabilities for normal dataSlide 33Slide 34Slide 35Normal quantile plots R- qqnorm()Slide 37Newcomb’s data without outliers.Looking at Data-RelationshipsSlide 40Slide 41Slide 42ScatterplotsExample 1: Mean height of a group of children in Kalama, Egypt, plotted against age from 18 to 29 months.Example 2: State mean SAT math score plotted against the percent of HS seniors taking the examSlide 46R Graphical systemSlide 48Example 2: Adding categorical variable/grouping (region): e is for northeastern states and m is for midwestern states (others excluded). May enhance understanding of the data.Slide 50Correlation CoefficientCorrelationSlide 53Formula:Slide 55Lecture 2Describing data with graphs and numbers. Normal Distribution. Data relationships.Describing distributions with numbers•Mean•Median•Quartiles•Boxplots•Standard deviationMean•The mean•The arithmetic mean of a data set (average value)•Denoted by x1 2...1nix x xx xn n+ + += =�•Mean highway mileage for 19 2-seaters:Sum: 24+30+….+30=490Divide by n=19 Average: 25.8 miles/gallonProblem: Honda Insight 68miles/gallon!If we exclude it, mean mileage: 23.4 miles/gallon•Mean can be easily influenced by outliers. It is not a robust measure of center.Median•Median is the midpoint of a distribution.•Median is a resistant or robust measure of center.•Not sensitive to extreme observations•In a symmetric distribution mean=median•In a skewed distribution the mean is further out in the long tail than is the median.•Example: house prices: usually right skewed–The mean price of existing houses sold in 2000 in Indiana was 176,200. (Mean chases the right tail)–The median price of these houses was 139,000.Measures of spread: Quartiles•Quartiles: Divides data into four parts•p-th percentile – p percent of the observations fall at or below it.•Median – 50-th percentile•Q1-first quartile – 25-th percentile (median of the lower half of data)•Q3-third quartile – 75-th percentile (median of the upper half of data)Using R:•First thing first import data. I prefer to use Excel first to save data into a .csv file (comma separated values).•Read the file TA01_008.XLS from the CD and save it as TA01_008.csv•Now R: I like to use tinn-R as the editor. Open tinn-R and save a file in the same directory that you pot the .csv file.•Now go to R/Rgui/ and click Initiate preferred. If everything is configured fine an R window should open•Now type and send line to R:•table1.08=read.csv("TA01_008.csv",header=TRUE)–This will import the data into R also telling R that the first line in the data contains the variable names.–Table1.08 has a “table” structure. To access individual components in it you have to use table1.08$nameofvariable, for example: •table1.08$CarType–Produces:•[1] Two Two Two Two Two Two Two Two Two Two Two Two Two Two Two •[16] Two Two Two Two Mini Mini Mini Mini Mini Mini Mini Mini Mini Mini Mini•Levels: Mini Two–This is a vector and notice that R knows it is a categorical variable.•mean(x) calculates the mean of variable x•median(x) will give the median•In fact you should read section 3.1 in the R textbook for all the functions you will need•summary(data.object) is another useful function. In fact:•summary(table1.08)– CarType City Highway – Mini:11 Min. : 8.00 Min. :13.00 – Two :19 1st Qu.:16.00 1st Qu.:22.25 » Median :18.00 Median :25.50 » Mean :18.90 Mean :25.80 » 3rd Qu.:20.75 3rd Qu.:28.00 » Max. :61.00 Max. :68.00•Lastly if you wish to apply functions only for the part of the dataframe that contains Mini cars:•tapply(table1.08$City,table1.08$CarType,mean)– Mini Two –18.36364 19.21053•The tapply call takes the table1.08$City variable, splits it according to table1.08$CarType variable levels and calculates the function mean for each group.•In the same way you can try:•tapply(table1.08$City,table1.08$CarType,summary)Five-Number Summary•Minimum Q1 Median Q3 Maximum•Boxplot – visual representation of the five-number summary.–Central box: Q1 to Q3.–Line inside box: Median–Extended straight lines: lowest to highest observation, except outliers–Outliers marked as circles or stars.•To make Boxplots in R use function•boxplot(x)R code:•boxplot(table1.08$City)•boxplot(table1.08$Highway)•boxplot(table1.08$City~table1.08$CarType)•boxplot(table1.08$Highway~table1.08$CarType)•par(mfrow=c(1,2))•boxplot(table1.08$City~table1.08$CarType)•boxplot(table1.08$Highway~table1.08$CarType)•par(mfrow=c(1,1))The criterion for suspected outliers•The interquartile range – IQR=Q3-Q1•An observation is a suspected outlier if it falls more then 1.5*IQR above the third quartile or below the first quartile.Standard deviation•Deviation : •Variance : s22 2 22 21 22 2( ) ( ) ... ( )1( )1 1standard deviation : s 1s = ( )1niix x x x x xs x xn ns x xn- + - + + -= = -- -= --��ix x-Properties of the standard deviation•Standard deviation is always non-negative•s=0 when there is no spread•s is not resistant to presence of outliers•The five-number summary usually better describes a skewed distribution or a distribution with outliers.•Mean and standard deviation are usually used for reasonably symmetric distributions without outliers.Linear Transformations: changing units of measurements•xnew=a+bxold•Common conversions •xmiles=0.62 xkm Distance=100km is equivalent to 62 miles•xg=28.35 xoz , 5 160 5( 32)9 9 9celsius fahr fahrx x x= - =- +•Linear transformations do not change the shape of a distribution.•They however change the center and the spread e.g: weights of newly hatched pythons (Example 1.21)PythonWeight1 2 3 4 5oz 1.13 1.02 1.23 1.06 1.16g 32 29 35 30 33•python.oz=c(1.13, 1.02,1.23,1.06,1.16)•python.g=28.35*python.oz•mean(python.oz)•mean(python.g)•sd(python.oz)•sd(python.g)•You could of course


View Full Document

STEVENS MA 331 - Lecture 2 Describing Data With Graphs and Numbers

Download Lecture 2 Describing Data With Graphs and Numbers
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 2 Describing Data With Graphs and Numbers and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 Describing Data With Graphs and Numbers 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?