2/10/11 Lecture 9 1 STOR 155 Introductory Statistics Lecture 9: Cautions about Regression and Correlation, Causation The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL2/10/11 Lecture 9 2 Review • Least-Squares Regression Lines • Equation and interpretation of the line • Prediction using the line • Correlation and Regression • Coefficient of Determination2/10/11 Lecture 9 3 Regression Diagnostics • Look at residuals (errors): – A residual is the difference between an observed value of the response variable and the value predicted by the regression line, i.e., – The sum of the least-squares residuals is always zero. .ˆresidual yy Why?2/10/11 Lecture 9 4 Residual Plots • A residual plot is a scatterplot of the regression residuals against the explanatory variable. • Residual plots help us assess the fit of a regression line.2/10/11 Lecture 9 5 Age vs Height2/10/11 Lecture 9 6 Residual Plot • If the regression line catches the overall pattern of the data, there should be no pattern in the residual. totally random2/10/11 Lecture 9 7 nonlinear nonconstant variation2/10/11 Lecture 9 8 Diabetes Patient: FPG vs HbA • FPG: fasting plasma glucose. • HbA: percent of red blood cells that have a glucose molecule attached. • Both are measuring blood glucose. • We expect a positive association. • 18 subjects, r = 0.4819. • See the scatterplot on the next page.2/10/11 Lecture 9 9 Diabetes Patient: FPG vs HbA2/10/11 Lecture 9 10 Outliers and Influential Observations • An outlier is a point that lies outside the overall pattern of the other points. – Outliers in the y direction have large residuals, but other outliers may not. • An influential obs. is a point that the regression line would be significantly changed with or without it. – Outliers in the x direction are often influential points. – But not always…2/10/11 Lecture 9 11 Diabetes Patient: FPG vs HbA2/10/11 Lecture 9 12 • Outliers in the y direction can be spotted from the residual plot. • Influential points can be identified by fitting regression lines with/without those points. More serious. – Can not be identified via residual plot. – Scatterplot gives us some hint. Outliers & Influential Obs.2/10/11 Lecture 9 13 Cautions about correlation and regression • Linear only • DO NOT extrapolate • Not resistant • Beware lurking variables • Beware correlations based on averaged data • The restricted-range problem2/10/11 Lecture 9 14 Lurking Variable • A lurking (hidden) variable is a variable that has an important effect on the relationship among the variables in a study, but is not included among the variables being studied. • Examples: – SAT scores and college grades • Lurking variable: IQ2/10/11 Lecture 9 15 Lurking variables can create nonsense correlations. • For the world’s nations, let x be the number of TVs/person and y be the average life expectancy; • A high positive correlation – nations with more TV sets have higher life expectancies. – Could we lengthen the lives of people in Rwanda by shipping them more TVs? • Lurking variable: wealth of the nation – Rich nations: more TV sets. – Rich nations: longer life expectancies because of better nutrition, clean water, and better health care. • There is no cause-and-effect tie between TV sets and length of life. • Association vs causation.2/10/11 Lecture 9 16 Misleading correlation (two clusters)2/10/11 Lecture 9 17 Beware correlations based on averaged data • A correlation based on averages over many individuals is usually higher than the correlation between the same variables based on data for individuals. • Age vs Height • (Basketball) score % vs practice time2/10/11 Lecture 9 18 The restricted-range problem • A restricted-range problem occurs when one does not get to observe the full range of the variables. • When data suffer from restricted range, r and r2 are lower than they would be if the full range could be observed. • SAT scores vs College GPA – Princeton vs Generic State College (Ex 2.26)2/10/11 Lecture 9 19 Causation vs Association • Some studies want to find the existence of causation. • Example of causation: – Increased drinking of alcohol causes a decrease in coordination. – Smoking and Lung Cancer. • Example of association: – The above two examples. – SAT scores and Freshman year GPA.2/10/11 Lecture 9 20 Association does not imply causation. • An association between two variables x and y can reflect many types of relationship among x, y, and one or more lurking variables. • An association between a predictor x and a response y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y.2/10/11 Lecture 9 21 Explaining Association2/10/11 Lecture 9 22 Explaining Association: Causation • Cause-and-effect • Examples – Amount of fertilizer and yield of corn – Weight of a car and its MPG – Dosage of a drug and the survival rate of the mice2/10/11 Lecture 9 23 Explaining Association: Common Response • Lurking variables • Both x and y change in response to changes in z, the lurking variable • There may not be direct causal link between x and y. • Examples: – SAT scores vs College GPA (IQ, Attitude) – Monthly flow of money into stock mutual funds vs rate of return for the stock market (Market Condition, Investor Attitude)2/10/11 Lecture 9 24 Explaining Association: Confounding • Two variables are confounded when their effects on a response variable are mixed together. • One explanatory variable may be confounded with other explanatory variables or lurking variables. • Examples: – More education leads to higher income. • Family background… – Religious people live longer. • Life style…2/10/11 Lecture 9 25 Establishing causation • The only compelling method: Designed experiment (More in Chapter 3) • Hot disputes: – Does gun control reduce violent crime? – Does meat consumption in your diet cause heart diseases? – Does smoking cause lung cancer?2/10/11 Lecture 9 26 Does smoking CAUSE lung cancer? • causation: smoking causes lung cancer. • common response: people who have a genetic predisposition to lung cancer also have a genetic predisposition to smoking. • confounding: people who drink too much, don't exercise, eat unhealthy foods, etc. are more likely to get lung
View Full Document