Multiple Regression Analysis MLR extension of the SLR for investigating how response y is affected by several independent variables x1 xk Model y 0 1 x1 k xk where error The estimated regression equation can be found by minimizing the sum of squared equation residuals least squared method This equation is b0 b1 x1 b2 x2 bk xk where b1 b2 bk are estimates R Squared describes relative improvement from using prediction equation instead of using sample mean SST y ybar 2 sum of squares total measures variation about the mean in responses SSE y 2 sum of squares equal measures spread about regression equation SSR SST SSE ybar 2 measures variation explained by the regression model SSR SST tells us what of variation in responses is explained by regression model R2 Properties 1 R2 is between 0 and 1 2 R2 1 when all residuals are 0 3 R2 0 when each y 4 R2 gets larger or at least stays the same whenever an independent variable is added to the multiple regression model 5 R2 does not depend on units of measurement Hypothesis Test 1 H 0 i 0 vs H a i 0 2 Test Statistics t i 0 se i 3 P value Two tail probability from t distribution of values larger than observed t test stats The t distribution has df n of parameters 4 Conclusion compare p value to significance level if decision needed Confidence Interval for i Estimate Margin of Error i t n k 1 More Test Statistics SE bi z p p0 p 0 Hypothesis Test H0 1 2 k 0 Ha At least one 0 ANOVA Table An estimate for se of squared residuals n k 1 or se MSE of s Degrees of Freedom Sum of Squares SSE Mean Sum of Squares F Stat P Value Model Regression of s 1 or SSR Error Residuals Total n 1 k n of n k 1 s or SSE SST MSR SSR K MSE SSE n k 1 F stat a test comparing statistical models that have been fitted to a data set F Distribution Fa b k number of predictor variables Test Statistic MSR MSE F F k n k 1 P Value The area is always going to the right and one sided Conclusion Compare p value to reject H If p value fail to reject H If p value 0 a MSE MST R2 Adjusted 1 where MST SST n 1 Conditions for Multiple Regression Model LINE Independence assume observations come from an SRS 1 Linearity look at all scatterplots of y vs xi to see if they are linear 2 3 Normality make histogram of residuals make a ggplot of residuals 4 Equal Spread residual plots are equally far away from zero Simple Linear Regression SLR Inference for SLR The data in a scatterplot are a random sample from a population that may exhibit a linear relationship between x and y DIFFERENT SAMPLE DIFFERENT PLOT y p1 x where 0 y 0 1 x In the regression y 0 1 x Population Model Above population linear equation is Sample Data then fits the model Date fit residual y i 0 1 xi i where iis independent normally distributed N 0 Linear Regression assumes equal variance of y is same for all values of x Conditions for SLR 1 Linearity 2 3 Normality 4 Equal Spread Independence Estimating the Parameters y 0 1 x where 0 intercept 1 slope Least squares regression line b0 b1 x is the best estimate of the true population regression line The population standard deviation for y at any given value of x represents the spread of the normal distribution of the i around mean y se of residuals n 2 2 Confidence Interval for Regression Parameters Standard error about the regression line Estimating regression parameters 0 1 is a case of one sample inference with unknown population variance Rely on the T distribution with n 2 degrees of freedom SE b1 se S x therefore SE b1 se sx n 1 Hypothesis Test H0 1 0 x and y are not linearly related Ha 1 0 x and y are linearly related T Test t b1 0 se P Value sum of the areas Conclusion is the same as usual Confidence CI t n 2 SE y SE y se 1 n tn 2 tobs tobs Interval for y mean response Predicting Individual Response Prediction Intervals t n 2 SE SE se 1 n Unusual Observations Outliers stand as either 1 a large residual or 2 a large distance from x A high leverage point is influential if omitting it changes slope of the regression model Influential Point similar to high leverage point does not necessarily have a high residual Normally Distributed Populations When a variable in a population is normally distributed the sampling distribution of x for all possible samples of size n is also normally distributed zx x n Central Limit Theorem if a population has mean and std dev then then for a large enough sample n 25 n 40 is preferable x N n is unknown then estimate Confidence Interval for pop means x z n If s Robustness the t procedures are robust to small deviations from normality meaning that the results will not be affected much Some factors that matter are 1 Random Sampling and 2 Outliers and Skewness Specifically by using the sample std dev This is a good estimate if n is large or x t s n n 1 1 When n 15 the data must be close to normal and without outliers 2 When 15 n 40 mild skewness is acceptable but no outliers 3 When n 40 t stats are valid even with stong skewness Comparing Two Groups Two populations p1 1 1 and p2 2 2 both parameters Take an SRS from each population Sample 1 x 1 s1 n1 Sample 2 x 2 s2 n2 z df 1 1 2 1 Hypothesis Tests for Two Population Means H0 1 2 Ha 1 2 1 2 or 1 2 0 1 x1 x2 1 2 2 2 n1 n2 1 2 H0 1 2 0 Ha 1 2 0 1 2 0 or 1 2 0 An estimate of 1 2 x1 x2 Confidence Interval for 1 2 x1 x2 t 2 df s1 n1 2 s2 n2
View Full Document