SIMPLE LINEAR REGRESSION 1 REMINDER OF STRAIGHT LINES CURVES THAT DON T TURN Forms of the equation of a line y mx b ax by c y b m x a y b0 b1x The a s b s and c s are not equivalent in these equations For slanted or horizontal lines b0 y intercept the value of y when x is zero b1 slope aka gradient the amount by which y changes when x increases by one unit For slanted or vertical lines x intercept the value of x when y is zero b0 b1 for slanted lines only Special Cases Horizontal lines equation y b slope 0 no x intercept if b 0 y intercept b The line y 0 the x axis has no unique x intercept Vertical lines equation x a slope is undefined x intercept a no y intercept if a 0 The line x 0 the y axis has no unique y intercept Lines that slant bottom left to top right have positive slopes Lines that slant top left to bottom right have negative slopes Slanted or horizontal lines that pass through the origin 0 0 have a y intercept of 0 Slope b1 2 Y int b0 13 X int 13 2 13 2 Slope b1 3 Y int b0 2 X int 2 3 2022 Radha Bose Florida State University Department of Statistics SIMPLE LINEAR REGRESSION 2 Residual e y vertical distance between data point and regression curve Linear regression is a mathematical process of finding the linear function that best fits the data that is the linear function that produces the smallest residuals When there is only one predictor present simple linear regression the linear function is visualized as a straight line The Least Squares method of finding the slope b1 and the y intercept b0 of the best fitting straight line is based on minimizing the sum of the squared residuals find line of best fit The minimization process causes the regression curve to pass through the centroid so the predicted y value obtained from the regression equation will be the y mean when the input is the xmean Sum of predicted responses sum of observed responses y Therefore average of predicted responses average of observed responses Therefore sum of residuals 0 and average of residuals 0 R2 explained deviation total deviation is a measure of fit the higher the better y regression equation residuals add to 0 square residuals line of best point crosses through centrioid to find centroid 2 4 8 6 4 5 x coordinate 5 8 9 2 4 6 y cordinate 2022 Radha Bose Florida State University Department of Statistics SIMPLE LINEAR REGRESSION 3 Residual y y e total Unexplained deviation residual Explained deviation x 5 and 6 so the centroid is the point with coordinates 5 5 Data from STA 2171 0001 SuC22 collected on May 9th 2022 Scatterplot of Student heights vs Mom heights along with Line of Best Fit 2022 Radha Bose Florida State University Department of Statistics SIMPLE LINEAR REGRESSION 4 Studht Y 75 75 74 73 70 69 67 67 66 66 66 66 64 63 62 61 61 60 Momht X 72 64 70 70 59 64 68 64 66 63 61 61 67 64 64 66 64 62 Stat test linrecttest Varables y variable function Slope b1 b 627 inches inch Units for slope are y units x unit Y int b0 a 26 240 Units for y intercept are y units Regression Equation y hat 26 240 627x Units are not included in the equation Alternatively perdicted student height 26 24 627 months r 44 R2 197 or 19 7 In the simple linear case R2 r 2 2022 Radha Bose Florida State University Department of Statistics SIMPLE LINEAR REGRESSION 5 Y variable output dependent response vertical axis potentially influenced by X this is the one we want to predict X variable input independent predictor explanatory regressor horizontal axis potentially influences Y simple regression only one predictor multiple regression more than one predictor Variable without hat observed sample real actual Variable with hat predicted estimated fitted forecasted Observed data points are x y Points on the curve are x Format of simple linear regression equation b0 b1x Interpreting the slope y intercept and R2 in words in context Slope Y variable is expected to go up down by b1 y units when x variable goes up by one x unit The slope will always mean something in the context of the data Y intercept Y variable is predicted to be b0 y units when x variable is zero x units The y intercept may or may not mean something in the context of the data R2 About R2 of the variation in y variable is explained by the linear association with x variable Slope student height is expected to go up by 627 when moms height increses by 1 inch Y intercept student height is predicted to be 26 240 inches when mothers is 0 inches R2 2022 Radha Bose Florida State University Department of Statistics The prediction for is always SIMPLE LINEAR REGRESSION 6 Predicting the Response If the regression curve fits well from equation is the best prediction If the regression curve does not fit well y bar average y is the best prediction The predicted response will always be within the range of observed y values when the predictor x is within the range of observed x values when xmin x xmax then ymin ymax Example Predict the student height if the mom is 60 inches tall Assume the regression line is a well fitting line Y hat 26 240 627 60 63 86 inches Assume the regression line is not a well fitting line Y bar 66 9 inches Do 1 var stat on y list to get average y variable Extrapolation is when we make predictions for x values that are far outside the range of available xvalues For example if we were to predict the student height for a mom who was 48 inches tall that would be extrapolation Extrapolation is generally not a safe thing to do since we can never foretell how Y is going to change with respect to X beyond the X window that we currently have the relationship between X and Y that we are seeing in the sample may not hold outside of the sample 2022 Radha Bose Florida State University Department of Statistics SIMPLE LINEAR REGRESSION 7 FEATURES OF THE ASSOCIATION BETWEEN THE VARIABLES THAT CAN BE SEEN ON A SCATTERPLOT Feature Form Direction Possibilities Linear or Non linear form presence of pattern association likely No form absence of pattern association not likely Positive direction bottom left to top right indicates that Y increases as X increases Negative direction top left to bottom right indicates that Y decreases as X increases A pos neg association between the variables is also indicated by a pos neg slope b1 or a pos neg linear correlation coefficient r How to identify it look at the pattern formed by the points look at the direction of the points Strength Stronger association less scatter with a clearer pattern …
View Full Document