Two Types of Data 1 Descriptive s 2 Inferential drawing conclusions based on data Quantitative Numerical A measured variable using units discrete there is a natural gap between s whole s Continuous close together Qualitative Categorical Classifying into categories Ordinal natural order Nominal no natural order 1 Selection Bias excludes one kind of individual from the survey 2 Non Response Bias subjects don t respond or skip questions 3 Response Bias Subjects lie Something in survey influences responses Voluntary Response Bias individuals choose whether to participate in sample Convenience Sample consists of people who are conveniently available Undercoverage gives a part of the pop less representation than it has in pop SRS every sample of size n in pop has an equal chance of being selected Stratified splits pop into homogenous groups and then SRS is done of each group Cluster split pop into similar groups randomly select group and perform census Systematic every Kth person is sampled Starting point is random Frequency counts Relative Frequency 4 kinds of data Interval no meaningful zero point but difference between 2 points is meaningful Ratio meaningful zero point Time Series same variable measured at regular intervals over time Cross Sectional several variables measured at the same point in time Independent distribution of one variable is the same for all categories of another One variable is always higher lower in every category Pie Chart if you have one variable this is best Must equal 100 Categories cannot overlap Area Principle area in graph must magnitude it represents Categorical Data data must be counts or of individuals in categories Contingency Table Marginal Distribution the distribution of one variable in respect to another Divide row column total by complete total Conditional Distribution divide count of each variable category by total and multiply by 100 Simpson s Paradox an association or comparison that holds for all of several groups can reverse direction when data are combined to form a single group Histograms Unimodal One main peak Bimodal two main peaks Uniform all same height Symmetric mirror images Skewed to right right tail stretches out farther than left Skewed to left left tail stretches out farther than right Range Minimum width Relative Frequency frequency total If there are Outliers mean is pulled to them If distribution is skewed mean is pulled to tail Median use when distribution is skewed Mean use when distribution is unimodal and symmetric Box Plots Symmetric mean median close together quartiles same distance from median IQRs Q3 Q1 Outliers outside the fence Upper fence Q3 1 5IQR Lower Fence Q1 1 5IQR 5 Number Summary min Q1 Med Q3 Max 25 in between all Interquartile Range Q3 Q1 Standard Deviation 1 Find mean 2 subtract mean from all s there s are called deviations 3 Square deviations 4 Add up squared deviations 5 Divide number of values n 1 This answer gives you the Variance spread about mean 6 Take sq root of number Scatter Plots Explanatory Variable predictor variable Used in a relationship to explain or predict changes in the values of another variable usually price on x axis Response Variable one whose value depends on the value of another variable Strength strong moderately strong very strong moderate weak Correlation r O no linear relationship Closer to 0 weak Closer to 1 1 strong Correlation Conditions 1 Quantitative Variable Condition correlation applies to quantitative variables only 2 Linearity Condition must be linear association 3 Outlier Condition No outliers If follows trend of points then it is not an outlier Slope y units per x units tells how response variable changes for a one unit step in the predictor variable Y intercept do not interpret unless 0 value for the explanatory variable would means something R 2 of variability in response variable explained by the explanatory variable To find correlation sq root r 2 and then add response Regression Line y b mx Y predicted value The regression line is a straight line that describes how a response value changes an explanatory variable It is the line for which the sum of the squared residuals is smallest Lurking not included in study design and has effect on variables studied and falsely suggests a relationship Cofounding 2 variables effects on a response variable cannot be distinguished Ecological Correlations based on rate or averages and tend to overstate the strength of associations Residuals distance from each point to the least squares regression line Sum always 0 Observed Predicted Residual Z score value mean SD Correlation does not mean association Scatter plots do NOT and NEVER say if any variable is predicative of another It just says whether or not two variables are correlated Remember correlation does not mean causation predicative accounts for of variation in explanatory
View Full Document