The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL STAT 155 Introductory Statistics Lecture 10 Cautions about Regression and Correlation Causation 10 03 06 Lecture 10 1 Review Least Squares Regression Lines Equation and interpretation of the line Prediction using the line Correlation and Regression Coefficient of Determination 10 03 06 Lecture 10 2 Regression Diagnostics Look at residuals errors A residual is the difference between an observed value of the response variable and the value predicted by the regression line i e residual y y The sum of the least squares residuals is always zero Why 10 03 06 Lecture 10 3 Residual Plots A residual plot is a scatterplot of the regression residuals against the explanatory variable Residual plots help us assess the fit of a regression line 10 03 06 Lecture 10 4 Age vs Height 10 03 06 Lecture 10 5 Residual Plot If the regression line catches the overall pattern of the data there should be no pattern in the residual totally random 10 03 06 Lecture 10 6 nonlinear nonconstant variation 10 03 06 Lecture 10 7 Diabetes Patient FPG vs HbA FPG fasting plasma glucose HbA percent of red blood cells that have a glucose molecule attached Both are measuring blood glucose We expect a positive association 18 subjects r 0 4819 See the scatterplot on the next page 10 03 06 Lecture 10 8 Diabetes Patient FPG vs HbA 10 03 06 Lecture 10 9 Outliers and Influential Observations An outlier is a point that lies outside the overall pattern of the other points Outliers in the y direction have large residuals but other outliers may not An influential obs is a point that the regression line would be significantly changed with or without it Outliers in the x direction are often influential points But not always 10 03 06 Lecture 10 10 Diabetes Patient FPG vs HbA 10 03 06 Lecture 10 11 Outliers Influential Obs Outliers in the y direction can be spotted from the residual plot Influential points can be identified by fitting regression lines with without those points More serious Can not be identified via residual plot Scatterplot gives us some hint 10 03 06 Lecture 10 12 Cautions about correlation and regression Linear only DO NOT extrapolate Not resistant Beware lurking variables Beware correlations based on averaged data The restricted range problem 10 03 06 Lecture 10 13 Lurking Variable A lurking hidden variable is a variable that has an important effect on the relationship among the variables in a study but is not included among the variables being studied Examples SAT scores and college grades Lurking variable IQ 10 03 06 Lecture 10 14 Lurking variables can create nonsense correlations For the world s nations let x be the number of TVs person and y be the average life expectancy A high positive correlation nations with more TV sets have higher life expectancies Could we lengthen the lives of people in Rwanda by shipping them more TVs Lurking variable wealth of the nation Rich nations more TV sets Rich nations longer life expectancies because of better nutrition clean water and better health care There is no cause and effect tie between TV sets and length of life Association vs causation 10 03 06 Lecture 10 15 Misleading correlation two clusters 10 03 06 Lecture 10 16 Beware correlations based on averaged data A correlation based on averages over many individuals is usually higher than the correlation between the same variables based on data for individuals Age vs Height Basketball score vs practice time 10 03 06 Lecture 10 17 The restricted range problem A restricted range problem occurs when one does not get to observe the full range of the variables When data suffer from restricted range r and r2 are lower than they would be if the full range could be observed SAT scores vs College GPA Princeton vs Generic State College Ex 2 22 10 03 06 Lecture 10 18 Causation vs Association Some studies want to find the existence of causation Example of causation Increased drinking of alcohol causes a decrease in coordination Smoking and Lung Cancer Example of association The above two examples SAT scores and Freshman year GPA 10 03 06 Lecture 10 19 Association does not imply causation An association between two variables x and y can reflect many types of relationship among x y and one or more lurking variables An association between a predictor x and a response y even if it is very strong is not by itself good evidence that changes in x actually cause changes in y 10 03 06 Lecture 10 20 Explaining Association 10 03 06 Lecture 10 21 Explaining Association Causation Cause and effect Examples Amount of fertilizer and yield of corn Weight of a car and its MPG Dosage of a drug and the survival rate of the mice 10 03 06 Lecture 10 22 Explaining Association Common Response Lurking variables Both x and y change in response to changes in z the lurking variable There may not be direct causal link between x and y Examples SAT scores vs College GPA IQ Attitude Monthly flow of money into stock mutual funds vs rate of return for the stock market Market Condition Investor Attitude 10 03 06 Lecture 10 23 Explaining Association Confounding Two variables are confounded when their effects on a response variable are mixed together One explanatory variable may be confounded with other explanatory variables or lurking variables Examples More education leads to higher income Family background Religious people live longer Life style 10 03 06 Lecture 10 24 Establishing causation The only compelling method Designed experiment More in Chapter 3 Hot disputes Does gun control reduce violent crime Does meat consumption in your diet cause heart diseases Does smoking cause lung cancer 10 03 06 Lecture 10 25 Does smoking CAUSE lung cancer causation smoking causes lung cancer common response people who have a genetic predisposition to lung cancer also have a genetic predisposition to smoking confounding people who drink too much don t exercise eat unhealthy foods etc are more likely to get lung cancer as a result of their lifestyle Such people may be more likely to be smokers as well 10 03 06 Lecture 10 26 Some guidelines when designed experiment is impossible strong association association consistent across various studies higher dose associated with stronger responses the cause precedes the effect in time plausibility 10 03 06 Lecture 10 27 Take Home Message Residual Plots Outliers and Influential Observations Lurking Variables Cautions about Correlation and Regression Explaining associations Causation Common
View Full Document