Units Exploratory Data Analysis Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison Statistics 371 13th September 2005 Student Data I A unit is an object that can be measured such as a person I All data on a single unit appears in a row Sex Female Female Female Male Male Male Female Female Female Male Level Fifth Fourth Fourth Second Fourth Second Fourth Third Second Fourth Brothers 0 2 1 0 0 0 0 0 1 1 Sisters 0 0 0 2 0 1 1 1 2 0 MilesHome 102 7 51 3 1023 5 130 1 280 4 152 4 162 7 123 4 5 2 210 8 Variables I A variable is a characteristic of a unit that can be assigned a number or a category I There is a column for each variable Data is often represented with a matrix Sex Female Female Female Male Male Male Female Female Female Male Level Fifth Fourth Fourth Second Fourth Second Fourth Third Second Fourth Brothers 0 2 1 0 0 0 0 0 1 1 Sisters 0 0 0 2 0 1 1 1 2 0 MilesHome 102 7 51 3 1023 5 130 1 280 4 152 4 162 7 123 4 5 2 210 8 Sex Female Female Female Male Male Male Female Female Female Male Level Fifth Fourth Fourth Second Fourth Second Fourth Third Second Fourth Brothers 0 2 1 0 0 0 0 0 1 1 Sisters 0 0 0 2 0 1 1 1 2 0 MilesHome 102 7 51 3 1023 5 130 1 280 4 152 4 162 7 123 4 5 2 210 8 Variables I Categorical Variables Variables are either quantitative Brothers 0 2 1 0 0 0 0 0 1 1 Sisters 0 0 0 2 0 1 1 1 2 0 MilesHome 102 7 51 3 1023 5 130 1 280 4 152 4 162 7 123 4 5 2 210 8 Variables I In a categorical variable measurements are categories I Examples include blood type sex I The variable year in school is an example of an ordinal categorical variable because the levels are ordered Quantitative Variables Variables are either Sex Female Female Female Male Male Male Female Female Female Male I Level Fifth Fourth Fourth Second Fourth Second Fourth Third Second Fourth or categorical I Quantitative variables record a number for each unit I Examples include height which is continuous and number of sisters which is discrete I Often continuous variables are rounded to a discrete set of values such as heights to the nearest inch or half inch I We can also make a categorical variable from a continuous variable by dividing the range of the variable into classes So for example height could be categorized as short average or tall I Identifying the types of variables can be important because some methods of statistical analysis are appropriate only for a specific type of variable Samples Summary of Blood Type Data Frequency table A 22 The number of observations in a sample is called the sample size I Common notation for the sample size is n I We typically use uppercase letters for variables and lower case letters for observed values National averages O 46 A 40 B 10 AB 4 Bar chart 20 I O NA s 21 28 10 A sample is a collection of units on which we have measured one or more variables B 9 0 5 I AB 6 A Summaries of Categorical Variables I A frequency distribution is a list of the observed categories and a count of the number of observations in each I A frequency distribution may be displayed with a table or with a bar chart I For ordinal categorical random variables it is conventional to order the categories in the display table or bar chart in the meaningful order I For non ordinal variables two conventional choices are alphabetical and by size of the counts I The vertical axis of a bar chart may show frequency or relative frequency I It is conventional to leave space between bars of a bar chart of a categorical variable AB B Summary of Majors Anthropology Bacteriology Biochemistry Biological Aspects of Conservation Biology Biomedical Engineering Botany Dairy Science Genetics Italian Kinesiology Medical Microbiology and Immunology Nutritional Sciences Political Science Soil Science Undecided Wildlife Ecology Wildlife Ecology Natural Resources Zoology sociology 1 1 5 1 3 18 9 1 5 19 1 7 4 1 1 1 2 3 1 2 1 O NA s A Dotplot of Hours of Sleep Quantitative variables from very small samples can be displayed with a dotplot Histogram of MilesClass 60 40 0 20 Frequency 80 I Miles from MSC 0 500 1000 1500 2000 MilesClass 9 Histograms are a more general tool for displaying the distribution of quantitative variables I A histogram is a bar graph of counts of observations in each class but no space is drawn between classes I If classes are of different widths the bars should be drawn so that areas are proportional to frequencies I Selection of classes is arbitrary Different choices can lead to different pictures I Too few classes is an over summary of the data I Too many classes can cloud important features of the data with noise Histogram of MilesClass MilesClass 20 0 I Corrected Miles from MSC 40 Histograms 30 8 20 7 10 6 Frequency 5 0 2 4 6 8 MilesClass MilesClass 20 10 Miles from Home Summary of Height 6 4 Frequency 8 10 12 20 0 2000 4000 6000 8000 55 MilesHome 60 65 70 75 Height inches 55 60 65 70 75 Height inches I 10 12 Stem and Leaf diagrams are useful for showing the shape of the distribution of small data sets without losing any or much information I Begin by rounding all data to the same precision 8 I The last digit is the leaf 6 Stem and Leaf Diagrams I Anything before the last digit is the stem 4 Miles from Home for Students within 250 miles I In a stem and leaf diagram each observation is represented by a single digit to the right of a line I Stems are shown only once I Show stems to fill gaps I Combining or splitting stems can lead to a better picture of the distribution 0 2 Histogram of MilesHome MilesHome 250 Frequency 0 0 0 2 5 10 Frequency 15 60 40 20 Frequency 80 Histogram of MilesHome 0 50 100 150 200 MilesHome MilesHome 250 250 Stem and Leaf Diagram of Brothers and Sisters Skewness I Histograms show several qualitative features of a quantitative variable such as the number of modes and skewness I A distribution is approximately symmetric if the left and right halves are approximately mirror images of each other I A distribution is skewed to the right if the right half of the data the larger values are more spread out than the left half of the data I A distribution is skewed to the left if the left half of the data the smaller values are more spread out than the right half of the …
View Full Document