UMD STAT 100 - Chapter 6

Unformatted text preview:

Chapter 6 scatterplot plots one quantitative variable against another to see if there is an association 1 direction negative or postive 2 form linear curved fanned no pattern 3 strength points tight together or spread out all over the place 4 unusual feature outlier explanatory predictator variable x independent if this is categorical then cant make a scatterplot but compare categories side by side with boxplots mean standard deviation response variable y dependent Measuring correlation linear association between 2 quantitative variables 1 standardize each value 2 computer correlation coefficient or measures strength of correlation r has no units and is invariant to scale changes or changes in the center but is sensitive to outliers cant be larger than 1 or 1 doesn t distinguish x and y r or 1 then points fall on line r is between 1 and 1 r 0 then no linear association BEFORE hand you must check 3 conditions 1 quantitative variables condition only quantitative variables have units 2 linearity condition only linear associations can be measured 3 outlier condition give association with and without outlier shouldn t be outliers correlation table rows and columns of table name variables and cells hold correlations diagonals always show 1 upper and lower half of table is symmetric No way to be there that from high correlation alone that one variable causes another because possibility of Lurking variables variable not included in study design that does have an effect on the variables studied falsely suggest a relationship just because 2 variables have high correlation doesn t mean that one causes the other Confounding 2 variable s effect on a response variable cannot be distinguished from each other may be explanatory or lurking Residual observed y predicted y tells us how far the model s prediction is from observed predicted value of y b b1x b and b1 are numbers estimated from the data sum of residuals should be 0 Regression Line Line of best fit the line for which the sum of the squared residuals is smallest least squares line r correlation sy standard deviation of response variable y sx is the standard slope deviation of the explanatory variable intercept use to find y intercept x and y are mean of x and y variables correlation measure of spread in both the x and y directions in the linear relationship regression examine the variation in the y with respect to x looks at linearity Standardizing predictor and response variables sd 1 and means 0 conditions 1 Linearity 2 Outlier no outliers 3 Independent residuals 4 Equal spread lif residuals all 0 then perfect look at sd of the residuals consistent spread of points around line test it by plotting residual values around line 0 divide residual SD and get how many SD residual is from actual if its about 0 66 then it is typical tips don t correlate categorical data make sure association linear causation one causes the other but this isn t what correlation means don t extrapolate far beyond data Scatterplot of residuals verses x values should stretch horizontally with no no bents or outliers r2 strength of linear association between 0 and 100 fraction of the data s variation accounted for by the model if it 100 then it s a perfect fit s 0 how successful regression is in fitting y to x want above 0 5 say it 0 72 then 72 then change in x only explains 72 of the change in y The rest of the change in y the vertical scatter shown as red arrows must be explained by something other than x if r2 0 means that none of the variance in the data is in the model all of it is still in the residuals percentage of variance in y that can be explained by changes in x Chapter 11 Central Limit Theorem The mean of a random sample has a sampling distribution whose shape can be approximated by a Normal Model the larger the sample the better the approximation means have smaller SD s than individuals means are less variable than individual observations Normal model for sampling distribution of the mean has SD of quantitative data means unbiased estimate correct on average population find t scores to find probability t x bar mean SD of the mean then go to table to find P SD of the CI t found on Table T need df and confidence level 95 0 025 One sample t interval ex 95 confident that the true mean profit of policies sold by this sales rep is contained in the interval from 942 48 to 1935 32 t models unimodal symmetric bell shaped just like normal but they have a narrower peek and fatter tails increase df then t model look like normal do not have less spread than normal dis categorical data proportions p population proportion p q 1 then find and go to the table to find CI ME z x SE SE is pretty much the same as SD Degree s of Freedom DF n 1 Assumptions and Conditions 1 independence 2 randomization 3 10 sample size no more than 10 of population 4 Nearly normal condition data unimodel symmetric and not scewed When n 15 the data must be close to normal and without outliers pWhen 15 n 40 mild skewness is acceptable but not outliers When n 40 the t statistic will be valid even with strong skewness Hypothesis Test 1 Stating the null and alternative hypotheses H0 versus Ha H0 m 0 versus Ha m 0 2 Deciding on a one sided or two sided test 3 Choosing a significance level a 4 Calculating t and its degrees of freedom 5 Finding the area under the curve with Table T if t ta thus the p value is significant reject null if t ta thus the p value is insignificant fail to reject null t calculated ta get from df chart 6 Stating the P value and interpreting the result go to t table chart and find what alpha s the t you calculated falls between All we can say is that the P value lies between the P values of these two critical values so 0 01 Pvalue 0 025 We can reject the null hypothesis at the statement The p value is equal to a OLD SCHOOL p value find z y ybar s and then go to z chart 100 a 1 confidence level is equivalent to the Sample size calculation If sample size is 60 or more use z value from normal model 1 N z x s ME 2 If sample size is smaller then use 1 N z x s ME 2 to find n 2 calculating df n 1 3 then find t 4 n t n 1 n 1 x s ME 2 Chapter 12 Comparing 2 Means side by side box plot OR difference of the means in the population at large ex cardholders with promotion and cardholders without promotion look at model SD of difference in means CI and test a hypothesis SD of difference in means independent sum of their variances xx 21 z 1 …

View Full Document

# UMD STAT 100 - Chapter 6

Pages: 8
Documents in this Course

13 pages

2 pages

2 pages

2 pages