The Who What Why Where When and How of Data Statistics consists of two parts Descriptive statistics coping with lots of numbers 1 Draw a picture graph charts etc 2 Calculate a few numbers which summarize the data mean median percentile Inferential statistics How can one make decisions and predictions about a population even if we have data for relatively few subjects from that population We need to generalize the facts we learn from a sample i e a part of the population to the entire population Data types Let s consider the following questions 1 What is your sex 2 How tall are you inches 3 What year are you in school 4 What is your major 6 How many miles do you travel to UMD each day 7 What is your GPA Variables Each question measures some aspect of you Variable the aspect characteristic that differs from subject to subject individual to individual Age Sex Major Data the value of the variables 20 Male English Two Types of Variables Quantitative or numerical variables Numbers measurements Age height miles traveled Qualitative or categorical variables Classifying each observation Sex year in school major Quantitative or numerical variables Discrete variables there is a natural gap between the values Number of children Number of credit cards Continuous variables the values can be arbitrarily close together Weight Height Age Qualitative or categorical variables Ordinal variables categories that have a natural ordering Numbers could be assigned to categories Class 1 Freshman 2 Sophomore 3 Junior 4 Senior Grade A B C D F GPA Preference Strongly Agree Agree Disagree Strongly Disagree Nominal variables categories that have no natural ordering Major business mathematics history Eye color blue green black Types of variables summary variables qualitative quantitative nominal ordinal discrete continuous Examples What are the types 1 Appraisal of a company s inventory level excellent good fair poor Qualitative ordinal 2 Mode of transportation to work automobile bicycle bus subway walk Qualitative nominal 3 Speed of a vehicle Quantitative continuous 4 The number of persons in each family Quantitative discrete 5 The diagnostic test for pneumonia symptoms present symptoms absent Qualitative nominal Interval data No meaningful zero point can t multiply or divide but the difference between two values is meaningful temperature Ratio data Meaningful zero point can multiply and divide Income weight height Time series data Ordered data values over time Cross sectional data Data values observed at a single point in time Sales in 10000 s 2003 2004 2005 2006 New York Dallas Seattle Orlando 435 320 405 260 460 345 390 270 475 375 410 285 490 395 395 280 Time Series Data Cross Sectional Data Surveys and Sampling Key ideas 1 Examine a part of the whole The first idea is to draw a sample Goal learn about an entire population of individuals but examining all of them is not feasible Examine a smaller group of individuals called sample chosen from the population Samples that over or underemphasize some characteristics of the population are said to be biased Bias Sample doesn t represent population Generalizations are no longer valid Conclusions may no longer be true Trouble Sources of Bias Selection Bias Problem in sampling scheme systematic tendency to exclude one kind of individual from the survey Difference between population of interest and effective population Non response Bias Subjects don t answer Skip questions Response Bias Subjects lie Interviewer effect Telephone Poll Bias Selection bias Cell phones Multiple phones Answerer Non response bias Answering machines Social life Response bias Why those internet polls are worthless Self selected sample More passionate More likely to respond Minority opinion more passion Opposite of the truth Selection Bias 2 Randomize Randomization can protect you against factors that you know are in the data It can also help protect against factors you are not even Randomization gets rid of biases Randomizing makes sure that on the average the sample looks like the rest of the population Sample to sample differences are referred to as sampling aware of error Idea 3 The Sample Size Is What Matters How large a random sample do we need for the sample to be reasonably representative of the population It is the size of the sample not the size of the population that makes the difference in sampling Exception If the population is small enough and the sample is more than 10 of the whole population the population size can matter The fraction of the population that you have sampled does not matter It is the sample size itself that is important Population vs Sample Population The entire group of Sample The part of the population individuals in which we are interested but can t usually assess directly Example All voters in the US Visa card holders in D C all packages at a UPS center we actually examine and for which we do have data How well the sample represents the population depends on the sample design Population Sample A parameter is a number describing a characteristic of the population A statistic is a number describing a characteristic of a sample Sampling Techniques Sampling Techniques Nonstatistical Sampling Statistical Sampling Convenience Voluntary Simple Random Systematic Stratified Cluster Convenience Collected in the most convenient manner for the researcher ask whoever is around Bias Opinions limited to individuals present Voluntary Individuals choose to be involved These samples are very susceptible to being biased because different people are motivated to respond or not Often called public opinion polls these are not considered valid or scientific Bias Sample design systematically favors a particular outcome Statistical Sampling Individuals in the sample are chosen based on known or calculable probabilities Statistical Sampling Probability Sampling Simple Random Stratified Systematic Cluster Simple Random Sampling Every possible sample of a given size has an equal chance of being selected The simplest way to obtain a sample is to draw names out of a hat The sample can be obtained using a table of random numbers or computer random number generator Sampling Frame List of population Examples Phone book Registered voter list Membership lists Effective population SRS picks equally from whole frame Stratified Random Sampling Divide population into subgroups called strata according to some common characteristic e g gender income level Select a simple random sample from each
View Full Document