Exploratory Data Analysis Variables Exploratory Data Analysis involves both graphical displays of data and numerical summaries of data A common situation is for a data set to be represented as a matrix There is a row for each unit and a column for each variable A unit is an object that can be measured such as a person or a thing A variable is a characteristic of a unit that can be assigned a number or a category In the data set I collected from the survey of students in the class each student who responded to the survey is a unit Variables include sex major year in school miles from home height and blood type Statistics 371 Fall 2003 1 We can also make a categorical variable from a continuous variable by dividing the range of the variable into classes So for example height could be categorized as short average or tall Identifying the types of variables can be important because some methods of statistical analysis are appropriate only for a specific type of variable Statistics 371 Fall 2003 2 Variables Exploratory Data Analysis Bret Larget Department of Statistics University of Wisconsin Madison September 8 2003 Statistics 371 Fall 2003 Variables are either quantitative or categorical In a categorical variable the measurement for each unit is a category For example for the data I collected in the course survey the variables blood type and sex are categorical The variable year in school is an example of an ordinal categorical variable because there is a meaningful ordering of the levels of the variable Quantitative variables record a number for each unit The variable height is an example of a continuous variable because it could take on a continuum of possible values The variable number of sisters is discrete because we can list the possible values Often continuous variables are rounded to a discrete set of values Most people round their height to the nearest inch or half inch Statistics 371 Fall 2003 2 Summaries of Categorical Variables Summary of Majors A frequency distribution is a list of the observed categories and a count of the number of observations in each A frequency distribution may be displayed with a table or with a bar chart For ordinal categorical random variables it is conventional to order the categories in the display table or bar chart in the meaningful order For non ordinal variables two conventional choices are alphabetical and by size of the counts The vertical axis of a bar chart may show frequency or relative frequency It is conventional to leave space between bars of a bar chart of a categorical variable Statistics 371 Fall 2003 4 Samples cbind summary Major Animal Science Bacteriology Biochemistry Biological Aspects of Conservation Biology Biology Zoology Biomedical Engineering Botany Dairy Science Forestry Genetics Graduate student in Bacteriology Kinesiology Medical MicroBiology and Immunology Molecular Biology Nursing Plant Pathology Possibly Kinesiology Undecided Wildlife Ecology Wildlife Ecology Natural Resources Zoology 1 2 1 3 2 19 1 9 1 1 1 31 1 4 1 1 1 1 1 1 5 1 11 Statistics 371 Fall 2003 6 Summary of Blood Type Data Here is a frequency table summary BloodType A AB B O NA s 25 12 3 22 37 Here is a bar chart barplot summary BloodType 0 5 15 25 35 A sample is a collection of units on which we have measured one or more variables The number of observations in a sample is called the sample size Common notation for the sample size is n The textbook adopts the convention of using uppercase letters for variables and lower case letters for observed values A Statistics 371 Fall 2003 3 AB Statistics 371 Fall 2003 B O NA s 5 Summary of Miles from CSSC hist MilesCSSC col 3 40 Frequency 60 80 Histogram of MilesCSSC 20 Quantitative variables from very small samples can be displayed with a dotplot Histograms are a more general tool for displaying the distribution of quantitative variables A histogram is a bar graph of counts of observations in each class but no space is drawn between classes If classes are of different widths the bars should be drawn so that areas are proportional to frequencies Selection of classes is arbitrary Different choices can lead to different pictures Too few classes is an over summary of the data Too many classes can cloud important features of the data with noise 0 Summaries of Quantitative Variables 0 10 20 30 40 MilesCSSC Statistics 371 Fall 2003 8 Statistics 371 Fall 2003 10 Summary of Second Majors A Dotplot of Hours of Sleep cbind summary Major2 source R dotplot R dotplot Sleep 1 69 Anthropology 1 Bacteriology 4 Biochemistry 2 Biological Aspects of Conservation 2 Biology 2 Dietetics 1 Economics 1 Entomology 1 French 1 Genetics 1 Horticulture 1 IES 2 Life Science Communication 1 Math 2 Physics 1 Plant Pathology 1 Political Science 1 Psychology 1 Scandinavian Studies 1 Spanish 1 Wildlife Ecology 2 Statistics 371 Fall 2003 5 7 6 Statistics 371 Fall 2003 7 8 9 9 Summary of Miles from Home for Stem and Leaf Diagrams Students within 150 miles Stem and Leaf diagrams are useful for showing the shape of the distribution of small data sets without losing any or much information Begin by rounding all data to the same precision The last digit is the leaf Anything before the last digit is the stem In a stem and leaf diagram each observation is represented by a single digit to the right of a line Stems are shown only once Show stems to fill gaps Combining or splitting stems can lead to a better picture of the distribution hist MilesHome MilesHome 150 col 3 8 6 0 2 4 Frequency 10 12 14 Histogram of MilesHome MilesHome 150 0 50 100 150 MilesHome MilesHome 150 Statistics 371 Fall 2003 12 Statistics 371 Fall 2003 14 Summary of Miles from Home Summary of Height hist MilesHome col 3 par mfrow c 1 2 hist Height keep col 2 hist Height keep breaks breaks col 3 Histogram of MilesHome Histogram of Height keep 6 4 Frequency 10 Frequency 0 0 0 2 20 5 40 Frequency 60 8 15 80 10 Histogram of Height keep 60 0 5000 10000 15000 65 70 75 60 65 70 75 20000 Height keep Height keep MilesHome Statistics 371 Fall 2003 11 Statistics 371 Fall 2003 13 Stem and Leaf of Miles from CSSC Measures of Center The decimal point is at the 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 12223333335555555555555578888888800000000000000000000000000022355555 000000005000 0000 0 00 There are two common measures of center for quantitative variables the mean and the median For sample data y1 y2 yn the sample mean is n X 0 0 0 Statistics 371 Fall 2003 16 yi y i 1 n The sample
View Full Document