1 Descriptive Statistics visual representation scatter plot histogram box plot numerical descriptive measures mean median percentile 2 Inferential Statistics decisions predictions about a population generalising from a sample 1 Quantitative Numerical discrete natural values e g number of kids continuous values are arbitrarily close together weight 3 Interval data No meaningful zero but difference between variables meaningful temperature 2 Qualitative Categorical ordinal categories with nat order grade year numbers could be assigned to these categories good bad nominal categories without nat order eye colour 4 Ratio data categories Meaningful zero can multiply divide income weight 6 Cross sectional data Data values observed at a single point in time 7 Response variable Predicted Effect interesting variable 5 Time series data Ordered data values over time 6 Explanatory Variable Predictor cause available variable The marginal distributions can be displayed on separate bar graphs Simpsons paradox When data is aggregated the association if reversed due to perhaps lurking variable hospital 1 Non statistical sampling convenience voluntary 2 Statistical sampling simple random e g drawing from a hat random numbers etc randomly select from sampling frame stratified dividing the population into subgroups e g gender income etc and selecting from the subgroup using the simple random technique drug testing systematic select k from sample every kth unit after Cluster subgroups representing the entire population e g ZIP county hospitals select random stratified 3 Sampling error Sample to sample differences 1 Selection bias systematic tendency to exclude one kind of individual from the survey e g Self selected sample more passionate more likely to respond minority opinion Selection bias is the difference between population of interest and effective population 2 Non response Bias don t answer skip questions 3 Response Bias subjects lie interviewer effect z value mean st dev b 1 r s s y x b 0 xby 1 y b 0 xb 1 r zxzy n 1 s 2 is the variance s is the standard deviation r is the correlation coefficient sy is the standard deviation of the response variable y sx is the the standard deviation of the explanatory variable x x bar y bar sample means b1 is the slope b0 is the y intercept Y hat is the predicted y value The least squares regression line y hat b1 measures the estimated change in the average value of Y as a result of a one unit change in X houses sq ft Range largest data point smallest data point 5 number summary Min Q1 middle observation below median Median Q3 middle obs above med and Max Box plots up down maximum Q3 Median Q1 minimum IQR Q3 Q1 AND IS LEAST SENSISITIVE TO OUTLIERS Correlation coefficient R Properties r does not distinguish between x and y r has no units of measurement r ranges from 1 to 1 Correlation of zero means no linear relationship only be used to describe quantitative variables Correlation is not affected by changes in the center or scale of either variable R 2 is the percentage variability that can be explained by x therefore correlation 2 percentage Correlation is sensitive to unusual observations A measure of the direction strength of a linear relationship Interpreting scatterplots After plotting two variables on a scatterplot we describe the relationship by examining the form direction and strength of the association We look for an overall pattern Form linear curved clusters no pattern Direction positive negative no direction Strength how closely points fit form and deviations from that pattern Outliers in diagram if scatter line horizontal X and Y vary independently Knowing x tells you nothing about knowing y Coefficient of determination r2 r2 represents the percentage of the variance in The correlation is a measure of spread scatter in both the x and y directions in the linear relationship In regression we examine the variation in the response variable y given change in the explanatory variable x Extrapolation is the use of a regression line for predictions outside the range of x values used to obtain the line This can be a very bad thing to do stretches it therefore not accurate y vertical scatter from regression line that can be explained by changes in x Linear regression model assumptions Quantitative Variables Condition both variables are quantitative variables Straight Enough linearity Condition a scatterplot of the data looks reasonably straight linear Outlier Condition correlation is highly sensitive to outliers Residuals The distances from each point to the least squares regression line give us potentially useful information about the contribution of individual data points to the overall pattern of scatter These distances are called residuals Sum of residuals is always 0 Because points below have negative and above have positive A lurking variable is a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables Two variables are confounded when their effects on a response variable cannot be distinguished from each other The confounded variables may be either explanatory variables or lurking variables Association is not causation Even if an association is very strong this is not by itself good evidence that a change in x will cause a change in y Categorical explanatory variables When the explanatory variable is categorical you cannot make a scatterplot but you can compare the different categories side by side on the same graph boxplots or mean standard deviation Independent events i e the outcome of a new coin flip is not influenced by the result of the previous flip Probability models describe mathematically the outcome of random processes They consist of two parts 1 S Sample Space This is a set or list of all possible outcomes of a random process Event subset of sample space 2 A probability for each possible event in the sample space S Number of cards in a pack 52 4 King Queen Jack ace 13 Heart diamond clubs spade Rolle two dice number of possible outcomes 36 Roulette has 38 slots numbered 00 0 and 1 36 Addition Rule For mutually exclusive events A B Events contain no common outcomes Intersection is empty They can t both happen e g roll 4 and 6 on dice P A or B P A P B Multiplication Rule only if both events are independent A and B independent P A or B P A P B P A and B Are not related Knowing A does not give information about B A does not affect B Two coin
View Full Document