CHAPTER 6 Correlation and Linear Regression Scatter plots LOOK FOR form linear strength of association direction pos or neg outliers r correlation coefficient ZxZy n 1 Properties of r 1 r 1 r 0 no LINEAR relationship r measures strength of linear association r 1 then the points are on a single line thus r is sensitive to outliers r doesn t change if you change the scale CORRELATION doesn t mean CAUSATION Regression a k a line of averages Regression line minimizes the sum of the squared residuals ALWAYS goes thru x and y b0 y b1 b1 r Sy Sx Sy Sx CHECK Linearity Independence Watch for outliers and equal spread The correlation is a measure of spread scatter in both the x and y directions in the linear relationship VERSUS In regression we examine the variation in the response variable y given change in the explanatory variable x Extrapolation is the use of a regression line for predictions outside the range of x values used to obtain the line BAD r2 represents the percentage of the variance in y vertical scatter from the regression line that can be explained by x y y r x x Zy r Zx changes in x Residuals The distances from each point to the least squares regression line give us potentially useful information about the contribution of individual data points to the overall pattern of scatter ALWAYS sum up to 0 The standard deviation of the residuals is Se SD y therefore Se sqrt 1 r2 SD y Residual Observed Predicted Negative value is overestimate Pos value is underestimate Residual plots Randomly scattered GOOD Curved pattern relationship is NOT linear BAD Fanning out pattern change in variability Beware of lurking variables variables not included in the study design that does have an effect on the variables studied CHAPTER 12 Comparing Two Groups Comparing two means two independent SRSs simple random samples coming from two distinct populations 1 1 and 2 2 We use When both pop are normal the sampling distribution of x 2 to estimate the unknown 1 and 2 x 1 x 2 is normal with standard deviation x 1 and 2 2 1 2 n n 1 2 The two sample z statistic is N 0 1 and is x 1 z x 1 2 2 2 2 2 1 n n 1 2 Testing Hypothesis for two means STEP 1 Ho 1 2 d vs Ha 1 1 2 d 2 1 2 d 3 1 2 d Check all assumptions Independent Groups Randomization Nearly Normal Condition If 1 2 unequal variance then 1 where df smallest n1 1 n2 Determine p value Reject Ho if p value OR Fail to reject Ho if p value We can also check Rejection Region If 1 2 pooled variance then where df n1 n2 2 CI Paired Samples Tests Means of two related populations paired or matched samples repeated measures before after use difference between paired values d x1 x2 Assumptions both populations are normally distributed or the population of the differences is normally distributed Confidence Interval t has df n 1 test statistic d 0 n i d d 1i n n 1i d i 2 d 1n s d d s d t n t d d s d n CHAPTER 13 Inference for Counts Chi Square Tests Testing for Categorical Data Testing a Probability Model Data Count data for a categorical variable with r categories Model A probability model assigns a probability to each category Aim Want to know whether this model is appropriate for the data i e is the model a good fit Assumptions Counted data condition Independence only for goodness of fit Randomization and expected cell frequency i 2 observed expected 2 degrees freedom r 1 r is of categories Goodness of Fit test Ho p1 1 p2 2 pk k vs Ha at least one p is not equal to its corresponding 2 ni n n I expected The larger the 2 the more evidence against Ho Can understand how distribution differs from the model by examining standardized residuals observed expected expected Characteristics of chi squared distribution 1 Not symmetric 2 Shape depends on the degrees of Freedom like the t distribution 3 As the df increase the chi squared becomes more symmetric 4 Values are non negative P value P X 2 where X 2 r 1 Sometimes it is necessary to combine categories if the expected counts are smaller than 5 Test for Independence Ho variables are independent vs Ha variables are dependent 2 observed expected 2 2 expected n total in table r 1 c 1 d f r 1 c 1 expected counts row total column total Test for Homogeneity Suppose data has been collected from an experiment or survey using stratified sampling so that each row column is a sample from a particular population the marginal row column totals are fixed In this situation the row column variable us an explanatory variable or factor and the column row variable is a response Ho The distribution of the response is the same for every level of the explanatory variable p1 p2 p3 Ha The distribution of the response is not the same for every level of the explanatory variable at least one p different 2 observed expected 2 2 expected n total in table r 1 c 1 d f r 1 c 1 expected counts row total column total Testing for Two Proportions Ho p1 p2 vs Ha p1 p2 We can do this using a homogeneity test OR Z statistic Where Total observations n1 n2 Total successes count1 x1 count2 x2 x1 n1 and x2 n2 and Approximate C I for CHAPTER 14 Inference for Regression The data in a scatterplot are a random sample from a population that may exhibit a linear relationship between x and y Different sample different plot Population mean response y as a function of explanatory variable x y 0 1x N 0 which is also y N 0 1x Estimation of this line is b0 b1x unbiased estimate for mean response y b0 unbiased estimate for intercept 0 b1 unbiased estimate for slope 1 Point is to minimize regression standard error Se se is an unbiased estimate of the regression standard deviation Conditions Assumptions for Inference LINE The relationship is linear The observations are independent The response y varies normally around the mean and the Standard deviation of y is the same for all values of x Use residual plots to check for regression validity residuals y if residuals are scattered randomly about 0 with uniform variation GOOD BAD if there is a curved pattern OR if you get a fan cone shape change in variability across plot Normal quantile plots if plot is fairly straight supports normally distributed residuals WATCH for outliers if the point has a large distance from significantly above or below the residual line it has high residual and influential point always affects the slope of the estimated regression line equation Confidence Interval rely on t distribution with n 2 degrees of freedom x then it is said to have high leverage if the point is …
View Full Document