DOC PREVIEW
UW-Madison STAT 371 - Exploratory Data Analysis

This preview shows page 1-2-14-15-30-31 out of 31 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 31 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Exploratory Data AnalysisBret LargetDepartment of StatisticsUniversity of Wisconsin - MadisonSeptember 8, 2003Statistics 371, Fall 2003Exploratory Data Analysis• Exploratory Data Analysis involves both graphical displays ofdataand numerical summaries of data.• A common situation is for a data set to be represented as amatrix.• There is arow for each unit and a column for each variable.• Aunit is an object that can be measured, such as a person,or a thing.• Avariable is a characteristic of a unit that can be assigneda number or a category.• In the data set I collected from the survey of students in theclass, each student who responded to the survey is a unit.Variables includesex, major, year in school, miles from home,height, and blood type.Statistics 371, Fall 2003 1Variables• Variables are either quantitative or categorical.• In acategorical variable, the measurement for each unit is acategory.For example, for the data I collected in the course survey,the variablesblood type and sex are categorical.• The variableyear in school is an example of an ordinalcategorical variable, because there is a meaningful orderingof the levels of the variable.•Quantitative variables record a number for each unit.• The variable height is an example of acontinuous variablebecause it could take on a continuum of possible values.• The variable number of sisters isdiscrete because we can listthe possible values.• Often, continuous variables are rounded to a discrete set ofvalues. Most people round their height to the nearest inch(or half inch).Statistics 371, Fall 2003 2Variables• We can also make a categorical variable from a continuousvariable by dividing the range of the variable into classes (So,for example, height could be categorized asshort, average,ortall).• Identifying the types of variables can be important becausesome methods of statistical analysis are appropriate only fora specific type of variable.Statistics 371, Fall 2003 2Samples• A sample is a collection of units on which we have measuredone or more variables.• The number of observations in a sample is called thesamplesize.• Common notation for the sample size isn.• The textbook adopts the convention of usinguppercaselettersfor variables and lower case letters for observed values.Statistics 371, Fall 2003 3Summaries of Categorical Variables• A frequency distribution is a list of the observed categoriesand a count of the number of observations in each.• A frequency distribution may be displayed with atable orwith abar chart.• Forordinal categorical random variables, it is conventionalto order the categories in the display (table or bar chart) inthe meaningful order.• For non-ordinal variables, two conventional choices arealphabetical and by size of the counts.• The vertical axis of a bar chart may showfrequency or relativefrequency.• It is conventional to leave space between bars of a bar chartof a categorical variable.Statistics 371, Fall 2003 4Summary of Blood Type DataHere is a frequency table.> summary(BloodType)A AB B O NA’s25 12 3 22 37Here is a bar chart.> barplot(summary(BloodType))A AB B O NA’s0 5 15 25 35Statistics 371, Fall 2003 5Summary of Majors> cbind(summary(Major))[,1]Animal Science 2Bacteriology 1Biochemistry 3Biological Aspects of Conservation 2Biology 19Biology/Zoology 1Biomedical Engineering 9Botany 1Dairy Science 1Forestry 1Genetics 31Graduate student in Bacteriology 1Kinesiology 4Medical MicroBiology and Immunology 1Molecular Biology 1Nursing 1Plant Pathology 1Possibly Kinesiology 1Undecided 1Wildlife Ecology 5Wildlife Ecology - Natural Resources 1Zoology 11Statistics 371, Fall 2003 6Summary of Second Majors> cbind(summary(Major2))[,1]69Anthropology 1Bacteriology 4Biochemistry 2Biological Aspects of Conservation 2Biology 2Dietetics 1Economics 1Entomology 1French 1Genetics 1Horticulture 1IES 2Life Science Communication 1Math 2Physics 1Plant Pathology 1Political Science 1Psychology 1Scandinavian Studies 1Spanish 1Wildlife Ecology 2Statistics 371, Fall 2003 7Summaries of Quantitative Variables• Quantitative variables from very small samples can bedisplayed with adotplot.•Histograms are a more general tool for displaying thedistribution of quantitative variables.• A histogram is a bar graph of counts of observations in eachclass, but no space is drawn between classes.• If classes are of different widths, the bars should be drawnso thatareas are proportional to frequencies.• Selection of classes is arbitrary. Different choices can lead todifferent pictures.• Too few classes is an over-summary of the data.• Too many classes can cloud important features of the datawith noise.Statistics 371, Fall 2003 8A Dotplot of Hours of Sleep> source("../R/dotplot.R")> dotplot(Sleep)5 6 7 8 9Statistics 371, Fall 2003 9Summary of Miles from CSSC> hist(MilesCSSC, col = 3)Histogram of MilesCSSCMilesCSSCFrequency0 10 20 30 400 20 40 60 80Statistics 371, Fall 2003 10Summary of Miles from Home> hist(MilesHome, col = 3)Histogram of MilesHomeMilesHomeFrequency0 5000 10000 15000 200000 20 40 60 80Statistics 371, Fall 2003 11Summary of Miles from Home forStudents within 150 miles> hist(MilesHome[MilesHome < 150], col = 3)Histogram of MilesHome[MilesHome < 150]MilesHome[MilesHome < 150]Frequency0 50 100 1500 2 4 6 8 10 12 14Statistics 371, Fall 2003 12Summary of Height> par(mfrow = c(1, 2))> hist(Height[keep], col = 2)> hist(Height[keep], breaks = breaks, col = 3)Histogram of Height[keep]Height[keep]Frequency60 65 70 750 5 10 15Histogram of Height[keep]Height[keep]Frequency60 65 70 750 2 4 6 8 10Statistics 371, Fall 2003 13Stem-and-Leaf Diagrams• Stem-and-Leaf diagrams are useful for showing the shapeof the distribution of small data sets without losing any (ormuch) information.• Begin by rounding all data to the same precision.• The last digit is theleaf.• Anything before the last digit is thestem.• In a stem-and-leaf diagram, each observation is representedby a single digit to the right of a line.• Stems are shown only once.• Show stems to fill gaps!• Combining or splitting stems can lead to a better picture ofthe distribution.Statistics 371, Fall 2003 14Stem-and-Leaf Diagram of Brothersand Sisters> stem(Brothers, scale = 0.5)The decimal point is at the |0 | 000000000000000000000000000000001 | 0000000000000000000000000000000000000000002 | 00000000000000003 | 00000004 | 00> stem(Sisters, scale = 0.5)The decimal point is at the |0 |


View Full Document

UW-Madison STAT 371 - Exploratory Data Analysis

Documents in this Course
HW 4

HW 4

4 pages

NOTES 7

NOTES 7

19 pages

Ch. 6

Ch. 6

24 pages

Ch. 4

Ch. 4

10 pages

Ch. 3

Ch. 3

20 pages

Ch. 2

Ch. 2

28 pages

Ch. 1

Ch. 1

24 pages

Ch. 20

Ch. 20

26 pages

Ch. 19

Ch. 19

18 pages

Ch. 18

Ch. 18

26 pages

Ch. 17

Ch. 17

44 pages

Ch. 16

Ch. 16

38 pages

Ch. 15

Ch. 15

34 pages

Ch. 14

Ch. 14

16 pages

Ch. 13

Ch. 13

16 pages

Ch. 12

Ch. 12

38 pages

Ch. 11

Ch. 11

28 pages

Ch. 10

Ch. 10

40 pages

Ch. 9

Ch. 9

20 pages

Ch. 8

Ch. 8

26 pages

Ch. 7

Ch. 7

26 pages

Load more
Download Exploratory Data Analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Exploratory Data Analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Exploratory Data Analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?