Unformatted text preview:

PubHlth 540 Topic 1: Summarizing Data Computer Illustration: R Page 1 of 9 BE540 - Introduction to Biostatistics Computer Illustration Topic 1 – Summarizing Data Software: R A Visit to Yellowstone National Park, USA Source: Chatterjee, S; Handcock MS and Simonoff JS A Casebook for a First Course in Statistics and Data Analysis. New York, John Wiley, 1995. Setting: Upon completion of BE540, you decide to take a vacation to the United States. Of particular interest is seeing an eruption of the famous "Old Faithful" geyser at Yellowstone National Park. Unfortunately, your time is limited and you do not wish to miss seeing an eruption. This worked example illustrates descriptive analysis of a data set of 222 interval times between eruptions of the Old Faithful Geyser, measured during August 1978 and 1979. Data File: GEYSER1.DAT - This is a data set in ASCII format. Description of Data: There are three variables, in the following order: INDEX - An index of the date of the eruption. We will not be using this variable. DURATION - The duration of the eruption in minutes. INTERVAL - The length of the interval between the current eruption and the next eruption. Objective: Describe the pattern of eruptions and predict the interval of time to the next eruption.PubHlth 540 Topic 1: Summarizing Data Computer Illustration: R Page 2 of 9 Before you begin: About R software: IR is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity. One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS. You can download R for free at http://www.r-project.org/ Preliminaries: (1) To do this illustration yourself, launch R in a new window. Then, as you proceed, use “copy and paste” from this illustration into the R command line. Press "Enter" to execute. (2) R commands, and instructions to press the “enter” key are in blue (3) Note: Where you see green, you need to replace what is written in this illustration with your own input. (4) R commands that begin with the number symbol, #, are comments. 1. Read in the data. Download and save the data file into a folder. # To read in the data file use the following command. Note that the path to the # actual file has forward slashes (/) and not (\) geyser.data <- read.table('C:/Users/YourPathToDataFile/geyser1.dat', sep=' ', col.names=c('index', 'duration', 'interval')) To view your data set that you read into R type geyser.data then press “Enter”PubHlth 540 Topic 1: Summarizing Data Computer Illustration: R 2. Obtain a Histogram of Interval Times. # Obtain a histogram of interval times hist(geyser.data$interval) You should see Remarks The interval times are in the range of 40 to 100 minutes, approximately. There appears to be two groupings of interval times. They are centered at 55 and 80 minutes, approximately. Interestingly, there is a gap in the middle. Page 3 of 9PubHlth 540 Topic 1: Summarizing Data Computer Illustration: R Page 4 of 9 3. Save this histogram as a picture that you can print directly or that you can insert into a document such as this one. TIP: To save a copy of the histogram right-click on the histogram and press “cntl+C”. Then paste it into any word processing document. 4. Instead of a histogram, we might have constructed a stem-leaf diagram. stem(geyser.data$interval) You should see: The decimal point is 1 digit(s) to the right of the | 4 | 234 4 | 55788999 5 | 0011111111111111222333334444 5 | 555566677778889 6 | 0000111112223 6 | 66677788999 7 | 00000111112222233333333344444 7 | 55555555555555566666666667777777788888889999 8 | 0000000000000111111111222222222233333333444444444 8 | 5666666788899 9 | 00011134 9 | 5 Remarks. You can see that a stem and leaf diagram is very similar to a histogram. However, we can also see that the minimum and maximum interval times are 42 and 95 minutes, respectively, and that the median time is 75 minutes.PubHlth 540 Topic 1: Summarizing Data Computer Illustration: R 5. In this example, a Box and Whisker plot is not very informative. # Obtain a box and whisker plot of interval times boxplot(geyser.data$interval) Remarks: Both the histogram and stem and leaf summaries suggested that there are two groups of interval times. This cannot be seen in a Box and Whisker plot. Box and Whisker plots are excellent for summarizing the distribution of ONE population. They are not informative when the sample being summarized actually represents MORE THAN ONE population. Page 5 of 9PubHlth 540 Topic 1: Summarizing Data Computer Illustration: R 6. We have information on duration of eruption also. One possibility is that the duration of the current eruption is a predictor of the interval time to the next eruption. To investigate this possibility, construct a scatter plot of interval time versus duration. Plot the


View Full Document

UMass Amherst PUBHLTH 540 - Summarizing Data

Download Summarizing Data
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Summarizing Data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Summarizing Data 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?