UW-Madison STAT 371 - Regression Lecture Notes - D1361471

Home> Schools> University of Wisconsin, Madison> Statistics (STAT) > STAT 371> Regression Lecture Notes

DOC PREVIEW

UW-Madison STAT 371 - Regression Lecture Notes

School name University of Wisconsin, Madison

Course Stat 371- Intro to Statistics

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CorrelationRegressionRileyFrogsAmazon TreesFEVRegressionBret LargetDepartments of Botany and of StatisticsUniversity of Wisconsin—MadisonStatistics 3716th December 2005CorrelationThe correlation coefficient r is measure of the strength of thelinear relationship between two variables.r =1n − 1nXi=1µxi− ¯xsx¶µyi− ¯ysy¶=P(xi− ¯x)(yi− ¯y)pP(xi− ¯x)2P(yi− ¯y)2Notice that the correlation is not affected by linear transformationsof the data (such as changing the scale of measurement).Correlation Plots−2 −1 0 1 2−2 −1 0 1 2r = 0.97xyCorrelation Plots−2 −1 0 1 2−10 −5 0 5 10r = 0.21xyCorrelation Plots−2 −1 0 1 2−10 −5 0 5 10r = −0.21xyCorrelation Plots−2 −1 0 1 2−2 −1 0 1 2r = 1xyCorrelation Plots−2 −1 0 1 2−2 −1 0 1 2r = −1xyCorrelation Plots−2 −1 0 1 20 1 2 3 4r = 0xyCorrelation Plots0.0 0.5 1.0 1.5 2.00 1 2 3 4r = 0.97xyCorrelation Plots−2.0 −1.5 −1.0 −0.5 0.00 1 2 3 4r = −0.97xySummary of CorrelationIThe correlation coefficient r measures the strength of thelinear relationship between two quantitative variables, on ascale from −1 to 1.IThe correlation coefficient is −1 or 1 only when the data liesperfectly on a line with negative or positive slope, respectively.IIf the correlation coefficient is near one, this means that thedata is tightly clustered around a line with a positive slope.ICorrelation coefficients near 0 indicate weak linearrelationships.IHowever, r does not measure the strength of nonlinearrelationships.IIf r = 0, rather than X and Y being unrelated, it can be thecase that they have a strongnonlinear relationshsip.IIf |r| is close to 1, it may still be the case that a nonlinearrelationship is a better description of the data than a linearrelationship.Simple Linear RegressionISimple linear regression is the statistical procedure fordescribing the relationship between an quantitativeexplanatory variable X and a quantitative response variable Ywith a straight line.IIn simple linear regression, the regression line is the line thatminimizes the sum of the squared residuals.RileyIRiley Larget is my son.IBelow is a plot of his height versus his age, from birth to 8years.0 20 40 60 8025 35 45 55Age (months)Height (inches)IThe plot indicates that it is not reasonable to model therelationship between age and height as linear over the entireage range, but it is fairly linear from age 2 years to 8 years(24–96 months).Riley30 40 50 60 70 80 9040 45 50Age (months)Height (inches)Finding a“best” linear fitIAny line we can use to predict Y from X will have the formY = b0+ b1X where b0is the intercept and b1will be theslope.IThe value ˆy = b0+ b1x is the predicted value of Y if theexplanatory variable X = x.IIn simple linear regression, the predicted values form a line.(In more advanced forms of regression, we can fit curves or fitfunctions of multiple explanatory variables.)Finding a“best” linear fitIFor each data point (xi, yi), the residual is the differencebetween the observed value and the predicted value, yi− ˆyi.IGraphically, each residual is the positive or negative verticaldistance from the point to the line.ISimple linear regression identifies the line that minimizes theresidual sum of squares,nXi=1(yi− ˆyi)2.Least Squares RegressionWe won’t derive them, but there are simple formulas for the slopeand intercept of the least squares line as a function of the samplemeans, standard deviations, and the correlation coefficient.Y = b0+ b1Xb1= r ×sysxb0= ¯y − b1¯xA Special CaseConsider the predicted value of an observation X = ¯x.ˆy = b0+ b1¯x= (¯y − b1¯x) + b1¯x= ¯ySo, the regression line always goes through the point (¯x, ¯y).The General CaseILet X = ¯x + zsx, so X is z standard deviations above themean.ˆy = b0+ b1(¯x + zsx)= (¯y − b1¯x) + b1¯x + b1zsx= ¯y +µrsysx¶zsx= ¯y + (rz) × syINotice that if X is z SDs above the mean, we predict Y to beonly rz SDs above the mean.IIn the typical situation, |r| < 1, so we predict the value of Yto be closer to the mean (in standard units) than X .IThis is called the regression effect.Riley (cont.)> n = length(age2)> mx = mean(age2)> sx = sd(age2)> my = mean(height2)> sy = sd(height2)> r = cor(age2, height2)> print(c(mx, sx, my, sy, r, n))[1] 61.5625000 21.8661649 45.6718750 5.4829043 0.9990835 16.0000000> b1 = r * sy/sx> b0 = my - b1 * mx> print(c(b0, b1))[1] 30.2493290 0.2505185(Riley’s predicted height in inches) = 30.25+0.25×(Riley’s age in months)Riley—Plot of Data30 40 50 60 70 80 9040 45 50Age (months)Height (inches)Riley — A Residual Plot40 45 50 55−0.2 0.0 0.2 0.4fitted(fit2)residuals(fit2)Riley — InterpretationIWe can interpret the slope to mean that from age 2 to 8years, Riley grew an average of about 0.25 inches per month,or about 3 inches per year.IThe intercept is the predicted value when X = 0, or Riley’sheight (length) at birth.IThis interpretation may not be reasonable if 0 is out of therange of the data.Riley — Extrapolation0 50 100 15020 30 40 50 60 70Age (months)Height (inches)Riley — ExtrapolationIPredicted height at age 15: ˆy = 30.25 + 0.25 × 180 = 75.30 50 100 15020 30 40 50 60 70Height (inches)Residual Standard DeviationIThe residual standard deviation, sY |X, is a measure of atypical size of a residual.IIts formula issY |X=rSS(resid)n − 2INotice that in simple linear regression, there are n − 2 degreesof freedom, in contrast to our formula n − 1.IThe reason is that our model for the mean uses twoparameters — it takes two points to determine a line.FrogsIIn a study on oocytes (developing egg cells) from the frogXenopus laevis, a biologist injects individual oocytes from thesame female with radioactive leucine, to measure the amountof leucine incorporated into protein as a function of time.(See Exercise 12.3 on page 536.)R — Entering the dataHere is some R code to create a data frame frog with the timeand leucine variables.> time = c(0, 10, 20, 30, 40, 50, 60)> leucine = c(0.02, 0.25, 0.54, 0.69, 1.07, 1.5, 1.74)> frog = data.frame(time, leucine)> frogtime leucine1 0 0.022 10 0.253 20 0.544 30 0.695 40 1.076 50 1.507 60 1.74R — Plotting the data> plot(time, leucine)0 10 20 30 40 50 600.0 0.5 1.0 1.5timeleucineR — Fitting a model> fit = lm(leucine ~ time, data = frog)> summary(fit)Call:lm(formula = leucine ~ time, data = frog)Residuals:1 2 3 4 5 6 70.0675 0.0050 0.0025 -0.1400 -0.0525 0.0850 0.0325Coefficients:Estimate Std. Error t value

View Full Document