Statistics Two types 1 Descriptive statistics coping with lots of numbers i e A Draw a picture graph charts etc B Calculate a few numbers which summarize the data mean median percentile 2 Inferential statistics generalizing from a sample part of the population to the entire population Variables the aspect characteristic that differs from subject to subject individual to individual i e Age Sex Major Data the value of the variables i e 20 Male English Time Series Data data over time one subject Cross Sectional Data data for multiple subjects one point in time Interval data no meaningful zero point can t multiply or divide but the difference btwn two values is meaningful Temperature Ratio data meaningful zero point can multiply and divide income weight or height Surveying Sampling Sample The part of the population we actually examine and for which we do have data v Population The entire group of individuals in which we are interested but can t usually assess directly Statistic a number describing a characteristic of a sample v Parameter a number describing a characteristic of the population Bias BAD whenever a sample under over emphasizes certain things in a population ex selection non response response Non response subjects don t answer skip questions Selection problem in selection scheme systematic tendency to exclude individuals from survey Response subjects lie interviewer effect ask questions differently to different people Sample to sample variation is called sampling error GOOD Sample Frame list of effective population Categorical Qualitative Data two way contingency tables Measured in counts or percentages Graphical Methods Bar graphs Pie charts Use the Area Principle Marginal distribution Row Column percentages Counts Using one variable ignoring the other Conditional distribution How does X Explanantory influence Y Response Only take column or row If identical close variables are independent if significantly different variables are dependent Simpson s Paradox An association or comparison that holds for all of several groups can reverse direction when the data are combined aggregated to form a single group Numerical Data continuous data Graphical Methods histograms have symmetry peaks spread outliers uses frequency stem and leaf plots and box plots Numerical Methods 5 number summary Max Q3 Median Q1 Min there is also the Mean Measure of Spread Range IQR Q3 Q1 Standard Deviation Z score z value mean EXCLUDE median in calculations of Q1 and Q3 st dev Fences Q1 1 5 IQR Q3 1 5 IQR USE Mean S D for symmetric data USE Median IQR for skewed data Correlation and Regression Scatter plots LOOK FOR form linear strength of association direction pos or neg outliers r correlation coefficient ZxZy n 1 Properties of r 1 r 1 r 0 no LINEAR relationship r measures strength of linear association r 1 then the points are on a single line thus r is sensitive to outliers r doesn t change if you change the scale CORRELATION doesn t mean CAUSATION Regression a k a line of averages Regression line minimizes the sum of the squared residuals ALWAYS goes thru x and y b0 y b1 x b1 r Sy Sx y y r x x Zy r Zx Sy Sx CHECK Linearity Independence Watch for outliers and equal spread The correlation is a measure of spread scatter in both the x and y directions in the linear relationship VERSUS In regression we examine the variation in the response variable y given change in the explanatory variable x Extrapolation is the use of a regression line for predictions outside the range of x values used to obtain the line BAD r2 represents the percentage of the variance in y vertical scatter from the regression line that can be explained by changes in x Residuals The distances from each point to the least squares regression line give us potentially useful information about the contribution of individual data points to the overall pattern of scatter ALWAYS sum up to 0 The standard deviation of the residuals is Se SD y therefore Se sqrt 1 r2 x SD y Residual Observed Predicted Negative value is overestimate Pos value is underestimate Residual plots Randomly scattered GOOD Curved pattern relationship is NOT linear BAD Fanning out pattern change in variability Beware of lurking variables variables not included in the study design that does have an effect on the variables studied Randomness and Probability Probability models have two parts 1 S Sample space set or list of all possible outcomes of random process An event is a subset of the sample space 2 A probability for each possible event in the sample space S P A Size of Event A Size of the Sample Space S outcomes in A Total outcomes The Addition OR Rule For Mutually exclusive events A B P A or B P A P B General Addition Rule For any events A B P A or B P A P B P A and B Compliment Rule P Ac 1 P A Multiplication AND Rule P A and B P A P B If and only if A and B are independent General Multiplication AND Rule P A and B P A P B A Checking Independence If P A P B P A and B A and B are independent If P A P B P A and B A and B are NOT independent Sequence of Independent Events if A1 A2 AN are independent P A1 and A2 and AN P A1 P A2 P AN Dependence A says something about B P B A conditional probability of B given A P A and B P A If A and B are independent P B A P B Subjective Probability An opinion or judgement by a decision maker about the likelihood of an event based on their expertise
View Full Document