Exploratory Data Analysis Bret Larget Department of Statistics University of Wisconsin Madison September 8 2003 Statistics 371 Fall 2003 Exploratory Data Analysis Exploratory Data Analysis involves both graphical displays of data and numerical summaries of data A common situation is for a data set to be represented as a matrix There is a row for each unit and a column for each variable A unit is an object that can be measured such as a person or a thing A variable is a characteristic of a unit that can be assigned a number or a category In the data set I collected from the survey of students in the class each student who responded to the survey is a unit Variables include sex major year in school miles from home height and blood type Statistics 371 Fall 2003 1 Variables Variables are either quantitative or categorical In a categorical variable the measurement for each unit is a category For example for the data I collected in the course survey the variables blood type and sex are categorical The variable year in school is an example of an ordinal categorical variable because there is a meaningful ordering of the levels of the variable Quantitative variables record a number for each unit The variable height is an example of a continuous variable because it could take on a continuum of possible values The variable number of sisters is discrete because we can list the possible values Often continuous variables are rounded to a discrete set of values Most people round their height to the nearest inch or half inch Statistics 371 Fall 2003 2 Variables We can also make a categorical variable from a continuous variable by dividing the range of the variable into classes So for example height could be categorized as short average or tall Identifying the types of variables can be important because some methods of statistical analysis are appropriate only for a specific type of variable Statistics 371 Fall 2003 2 Samples A sample is a collection of units on which we have measured one or more variables The number of observations in a sample is called the sample size Common notation for the sample size is n The textbook adopts the convention of using uppercase letters for variables and lower case letters for observed values Statistics 371 Fall 2003 3 Summaries of Categorical Variables A frequency distribution is a list of the observed categories and a count of the number of observations in each A frequency distribution may be displayed with a table or with a bar chart For ordinal categorical random variables it is conventional to order the categories in the display table or bar chart in the meaningful order For non ordinal variables two conventional choices are alphabetical and by size of the counts The vertical axis of a bar chart may show frequency or relative frequency It is conventional to leave space between bars of a bar chart of a categorical variable Statistics 371 Fall 2003 4 Summary of Blood Type Data Here is a frequency table summary BloodType A AB B O NA s 25 12 3 22 37 Here is a bar chart barplot summary BloodType 35 25 15 0 5 A AB B O NA s Statistics 371 Fall 2003 5 Summary of Majors cbind summary Major Animal Science Bacteriology Biochemistry Biological Aspects of Conservation Biology Biology Zoology Biomedical Engineering Botany Dairy Science Forestry Genetics Graduate student in Bacteriology Kinesiology Medical MicroBiology and Immunology Molecular Biology Nursing Plant Pathology Possibly Kinesiology Undecided Wildlife Ecology Wildlife Ecology Natural Resources Zoology 1 2 1 3 2 19 1 9 1 1 1 31 1 4 1 1 1 1 1 1 5 1 11 Statistics 371 Fall 2003 6 Summary of Second Majors cbind summary Major2 Anthropology Bacteriology Biochemistry Biological Aspects of Conservation Biology Dietetics Economics Entomology French Genetics Horticulture IES Life Science Communication Math Physics Plant Pathology Political Science Psychology Scandinavian Studies Spanish Wildlife Ecology 1 69 1 4 2 2 2 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 2 Statistics 371 Fall 2003 7 Summaries of Quantitative Variables Quantitative variables from very small samples can be displayed with a dotplot Histograms are a more general tool for displaying the distribution of quantitative variables A histogram is a bar graph of counts of observations in each class but no space is drawn between classes If classes are of different widths the bars should be drawn so that areas are proportional to frequencies Selection of classes is arbitrary Different choices can lead to different pictures Too few classes is an over summary of the data Too many classes can cloud important features of the data with noise Statistics 371 Fall 2003 8 A Dotplot of Hours of Sleep source R dotplot R dotplot Sleep 5 6 7 8 9 Statistics 371 Fall 2003 9 Summary of Miles from CSSC hist MilesCSSC col 3 Histogram of MilesCSSC 80 Frequency 60 40 20 0 0 10 20 30 40 MilesCSSC Statistics 371 Fall 2003 10 Summary of Miles from Home hist MilesHome col 3 Histogram of MilesHome 80 Frequency 60 40 20 0 0 5000 10000 15000 20000 MilesHome Statistics 371 Fall 2003 11 Summary of Miles from Home for Students within 150 miles hist MilesHome MilesHome 150 col 3 Histogram of MilesHome MilesHome 150 14 12 Frequency 10 8 6 4 2 0 0 50 100 150 MilesHome MilesHome 150 Statistics 371 Fall 2003 12 Summary of Height par mfrow c 1 2 hist Height keep col 2 hist Height keep breaks breaks col 3 Histogram of Height keep Frequency 15 10 Histogram of Height keep 5 Frequency 10 8 6 4 0 2 60 65 70 0 75 60 Height keep 65 70 75 Height keep Statistics 371 Fall 2003 13 Stem and Leaf Diagrams Stem and Leaf diagrams are useful for showing the shape of the distribution of small data sets without losing any or much information Begin by rounding all data to the same precision The last digit is the leaf Anything before the last digit is the stem In a stem and leaf diagram each observation is represented by a single digit to the right of a line Stems are shown only once Show stems to fill gaps Combining or splitting stems can lead to a better picture of the distribution Statistics 371 Fall 2003 14 Stem and Leaf Diagram of Brothers and Sisters stem Brothers scale 0 5 The decimal point is at the 0 00000000000000000000000000000000 1 000000000000000000000000000000000000000000 2 0000000000000000 3 0000000 4 00 stem Sisters scale 0 5 The decimal point is at the 0 1 2 3 4 000000000000000000000000000000000000000000 0000000000000000000000000000000000000000 000000000 00000 000 Statistics 371 Fall 2003 15 Stem and Leaf of
View Full Document