## TOPIC 1

## TOPIC 1

Department of Statistics TEXAS A M UNIVERSITY STAT 211 Instructor Keith Hatfield 1 Topic 1 Data collection and summarization Populations and samples Frequency distributions Histograms Mean median variance and standard deviation Quartiles interquartile range Boxplots What is Statistics What do you think of when you hear the word statistics sports boring not applicable to my field of study Statistics is the science of collecting classifying and interpreting data learning 3 Where will Statistics be used Everyday life for example Proper application of general probabilities How election results are presented Commercial claims clinical trials vs outliers Applications in the workplace Google web searches Netflix user recommendations Pharmaceutical drug development Sports analytics Modeling global climate change Credit card fraud detection Biomarkers and disease detection Criminal justice Many many more 4 Regardless of the field that you pursue you ve got to know statistics Deepak Kumar LinkedIn Principal Data Scientist Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write H G Wells author of War of the Worlds The best thing about being a statistician is that you get to play in everyone s backyard John Tukey Statistician I keep saying that the sexy job in the next 10 years will be statisticians and I m not kidding Hal Varian Google Chief Economist Collecting data Observational study Observe a group and measure quantities of interest This is passive data collection in that one does not attempt to influence the group The purpose of the study is to describe the group Experimental study Deliberately impose treatments on groups in order to observe responses The purpose is to study whether the treatments cause a change in the responses 6 Observational Study Terms Population The entire group of interest Sample A part of the population selected to draw conclusions about the entire population Census A sample that attempts to include the entire population Parameter A concept that describes the population Statistic A number produced from a sample that 7 estimates a population parameter Horry County SC Murder Case Do juries properly represent the racial makeup of Horry County which is 13 African American What is the population parameter of interest What sample statistic could be used to estimate the parameter and does the sample support the claim 295 jurors summoned 22 were African American 8 Experiment Terms Experimental Group A collection of experimental units subjected to a difference in treatment imposed by the experimenter Control Group A collection of experimental units subjected to the same conditions as those in an experimental group except that no treatment is imposed This design helps control for potential confounding effects 9 What are confounding effects When you have multiple factors in a study and you can t tell which factor causes a change in the variable of interest Example Does going to church make you live longer Not necessarily There are too many other factors or lurking variables discussed later Best to set up study with everything else constant and have only one factor changed That way you re more apt to identify that the change in the variable is due to the change you instituted in the study 10 NCTR study National Center for Toxicological Research A large scale study was conducted to see if a new drug might have potential toxic effects They used rats for the experiment Dose groups of 0 100 200 and 400 ppg were evaluated for liver tumors at the end of a two week exposure to the drug which is the control and which are the experimental groups What comparisons would you want to make Should you evaluate each group on consecutive days at the end of the study 11 Analyzing data with StatCrunch StatCrunch is a statistical software package that runs through a Web browser You can access StatCrunch once you have registered and created an account See the information tab in eCampus for details No tutorials for StatCrunch but demonstrations of how to perform basis tasks and tests will be done in class Note that the homework uses StatCrunch Several datasets will be given in the homework and in class examples I don t advise using your calculator for this purpose as it can be tedious and lead to input errors 12 All about variables Variable Any characteristic or quantity to be measured on units in a study Categorical variable Places a unit into one of several categories Examples Gender race political party Quantitative variable Takes on numerical values for which arithmetic makes sense Examples SAT score number of siblings cost of textbooks Univariate data has one variable Bivariate data has two variables Multivariate data has three or more variables 13 Cereal data mfr A American Home G General Mills K Kelloggs N Nabisco P Post Q Quaker Oats R Ralston Purina type cold or hot calories calories per serving protein grams of protein fat grams of fat sodium milligrams of sodium fiber grams of dietary fiber carbo grams of complex carbohydrates sugars grams of sugars potass milligrams of potassium vitamins vitamins and minerals 0 25 or 100 indicating the typical percentage of FDA recommended shelf display shelf 1 2 or 3 counting from the floor weight weight in ounces of one serving cups number of cups in one serving rating a rating of the cereal 14 Summarizing a single categorical variable Frequency number of times the value occurs in the data Relative frequency proportion of the data with the value mfr Frequency Relative Frequency A 1 0 012987013 G 22 0 2857143 K 23 0 2987013 N 6 0 077922076 P 9 0 116883114 Q 8 0 103896104 R 8 0 103896104 Cereal data 15 Analyzing a single quantitative variable Consider the concentration data which contains the concentration of suspended solids in parts per million at 50 locations along a river What is a typical concentration Generally characterized by the center of the data How much spread is there in the concentrations along the river Generally the relative width of the data how dispersed they are around the center Wide versus narrow and the inherent good and bad things about spread Discuss the difference in typical and spread if taken at a single point on the river versus several points along the river 16 Histograms Histogram bar graph of binned or grouped data where the height of the bar above each bin denotes the frequency relative frequency of values in the bin Typical concentration Spread Roughly how many

