The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL STOR 155 Introductory Statistics Lecture 9 Cautions about Regression and Correlation Causation 5 20 11 Lecture 9 1 Review Least Squares Regression Lines Equation and interpretation of the line Prediction using the line Correlation and Regression Coefficient of Determination 5 20 11 Lecture 9 2 Regression Diagnostics Look at residuals errors A residual is the difference between an observed value of the response variable and the value predicted by the regression line i e residual y y The sum of the least squares residuals is always zero Why 5 20 11 Lecture 9 3 Residual Plots A residual plot is a scatterplot of the regression residuals against the explanatory variable Residual plots help us assess the fit of a regression line 5 20 11 Lecture 9 4 Age vs Height 5 20 11 Lecture 9 5 Residual Plot If the regression line catches the overall pattern of the data there should be no pattern in the residual totally random 5 20 11 Lecture 9 6 nonlinear nonconstant variation 5 20 11 Lecture 9 7 Diabetes Patient FPG vs HbA FPG fasting plasma glucose HbA percent of red blood cells that have a glucose molecule attached Both are measuring blood glucose We expect a positive association 18 subjects r 0 4819 See the scatterplot on the next page 5 20 11 Lecture 9 8 Diabetes Patient FPG vs HbA 5 20 11 Lecture 9 9 Outliers and Influential Observations An outlier is a point that lies outside the overall pattern of the other points Outliers in the y direction have large residuals but other outliers may not An influential obs is a point that the regression line would be significantly changed with or without it Outliers in the x direction are often influential points But not always 5 20 11 Lecture 9 10 Diabetes Patient FPG vs HbA 5 20 11 Lecture 9 11 Outliers Influential Obs Outliers in the y direction can be spotted from the residual plot Influential points can be identified by fitting regression lines with without those points More serious Can not be identified via residual plot Scatterplot gives us some hint 5 20 11 Lecture 9 12 Cautions about correlation and regression Linear only DO NOT extrapolate too much Not resistant Beware lurking variables Beware correlations based on averaged data The restricted range problem 5 20 11 Lecture 9 13 Lurking Variable A lurking hidden variable is a variable that has an important effect on the relationship among the variables in a study but is not included among the variables being studied Examples SAT scores and college grades Lurking variable IQ 5 20 11 Lecture 9 14 Lurking variables can create nonsense correlations For the world s nations let x be the number of TVs person and y be the average life expectancy A high positive correlation nations with more TV sets have higher life expectancies Could we lengthen the lives of people in Rwanda by shipping them more TVs Lurking variable wealth of the nation Rich nations more TV sets Rich nations longer life expectancies because of better nutrition clean water and better health care There is no cause and effect tie between TV sets and length of life Association vs Causation 5 20 11 Lecture 9 15 Misleading correlation two clusters 5 20 11 Lecture 9 16 Beware correlations based on averaged data A correlation based on averages over many individuals is usually higher than the correlation between the same variables based on data for individuals Age vs Height Basketball score vs practice time 5 20 11 Lecture 9 17 The restricted range problem A restricted range problem occurs when one does not get to observe the full range of the variables When data suffer from restricted range r and r2 are lower than they would be if the full range could be observed SAT scores vs College GPA Princeton vs Generic State College Ex 2 26 5 20 11 Lecture 9 18 Causation vs Association Some studies want to find the existence of causation Example of causation Increased drinking of alcohol causes a decrease in coordination Smoking and Lung Cancer Example of association The above two examples SAT scores and Freshman year GPA 5 20 11 Lecture 9 19 Association does not imply causation An association between two variables x and y can reflect many types of relationship among x y and one or more lurking variables An association between a predictor x and a response y even if it is very strong is not by itself good evidence that changes in x actually cause changes in y 5 20 11 Lecture 9 20 Explaining Association 5 20 11 Lecture 9 21 Explaining Association Causation Cause and effect Examples Amount of fertilizer and yield of corn Weight of a car and its MPG Dosage of a drug and the survival rate of the mice 5 20 11 Lecture 9 22 Explaining Association Common Response Lurking variables Both x and y change in response to changes in z the lurking variable There may not be direct causal link between x and y Examples SAT scores vs College GPA IQ Attitude Monthly flow of money into stock mutual funds vs rate of return for the stock market Market Condition Investor Attitude 5 20 11 Lecture 9 23 Explaining Association Confounding Two variables are confounded when their effects on a response variable are mixed together One explanatory variable may be confounded with other explanatory variables or lurking variables Examples More education leads to higher income Family background Religious people live longer Life style 5 20 11 Lecture 9 24 Establishing causation The only compelling method designed experiment in Chapter 3 Hot disputes Does gun control reduce violent crime Does meat consumption in your diet cause heart diseases Does smoking cause lung cancer 5 20 11 Lecture 9 25 Does smoking CAUSE lung cancer causation smoking causes lung cancer common response people who have a genetic predisposition to lung cancer also have a genetic predisposition to smoking confounding people who drink too much don t exercise eat unhealthy foods etc are more likely to get lung cancer as a result of their lifestyle Such people may be more likely to be smokers as well 5 20 11 Lecture 9 26 Take Home Message Residual Plots Outliers and Influential Observations Lurking Variables Cautions about Correlation and Regression Explaining associations Causation Common response Confounding How to establish causation 5 20 11 Lecture 9 27
View Full Document