New version page

# UW-Madison STAT 849 - Part 1 - Introduction to R

Pages: 7
Documents in this Course

6 pages

29 pages

9 pages

5 pages

37 pages

Unformatted text preview:

Part 1: Introduction to RDouglas BatesUniversity of Wisconsin - Madisonand R Development Core Team<[email protected]>Sept 8, 2010OutlineWhat is R?Organizing dataAccessing and modifying variablesSubsets of data framesMissing DataRIR is an Open Source (and freely available) environment forstatistical computing and graphics.IThe CRAN links on the course web site provide binarydownloads for Windows, for Mac OS X and for several flavorsof Linux. Source code is also available.IR is under active development - typically two major releasesper year.IR provides data manipulation and display facilities and moststatistical procedures. It can be extended with “packages”containing data, code and documentation. Currently there aremore than 2400 contributed packages in the Comprehensive RArchive Network (CRAN).Simple calculator usageIThe R application is started by clicking on an icon or a menuitem. The main window is called the console window.IArithmetic expressions can be typed in the console window. Ifthe expresssion on a line is complete it is evaluated and theresult is printed.> 5 - 1 + 10[1] 14> 7 * 10/2[1] 35> exp(-2.19)[1] 0.1119167> pi[1] 3.141593> sin(2 * pi/3)[1] 0.8660254Comments on the calculator usageIThe > symbol at the beginning of the input line is the promptfrom the application, not something that is typed by the user.IIf the expression typed is incomplete, say because it contains a( without the corresponding ) then the prompt changes to a+ indicating that more input is required.IThe expression [1] at the beginning of the response is anindex indicating that what follows is the first (and in thesecases the only) element of a numeric vector.Assignment of values to namesIDuring a session, data objects can be assigned to names.IThe assignment operator is the two-character sequence <-.(The = sign can also be used, except in a few cases.)IThe function ls lists the names of objects; rm removesobjects. An alternative to ls is ls.str() which lists objectsin the workspace and provides a brief description of theirstructure.> x <- 5> ls()[1] "x"> ls.str()x : num 5> rm(x)> ls()character(0)VectorsINumeric objects are always stored as vectors (as opposed toscalars).IAn easy way to create a non-trivial vector is a sequence,generated by the : operator or the seq function.IWhen results are printed the number in square brackets at thebeginning of the line is the index of the element at the start ofthe line.ISquare brackets are used to specify indices (or, in general,subsets).> (x <- 0:19)[1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19> x[5][1] 4> str(y <- x + runif(20, min = 10, max = 20))num [1:20] 17 18.9 17.3 19.2 20.6 ...Following the operations on the slidesIThe lines of R code shown on these slides are available in fileson the course web site. The file for this section is called1Intro.R.IIf you open this file in the R application (the File→Openmenu item or <ctrl>-O) and position the cursor at aparticular line, then <ctrl>-R will send the line to the consolewindow for execution and step to the next line.IAny part of a line following a # symbol is a comment.IThe code is divided into named “chunks”, typically one chunkper slide that contains code.IIn the system called Sweave used to generate the slides theresult of a call to a graphics function must be printed. Ininteractive use this is not necessary but neither is it harmful.INote that R provides name completion with the <tab> key.After typing part of a name you can use tab to requestcompletion.Organizing data in RIStandard rectangular data sets (columns are variables, rowsare observations) are stored in R as data frames.IThe columns can be numeric variables (e.g. measurements orcounts) or factor variables (categorical data) or ordered factorvariables. These types are called the class of the variable.IThe str function provides a concise description of thestructure of a data set (or any other class of object in R). Thesummary function summarizes each variable according to itsclass. Both are highly recommended for routine use.IEntering just the name of the data frame causes it to beprinted. For large data frames use the head and tailfunctions to view the first few or last few rows.Data inputIThe simplest way to input a rectangular data set is to save itas a comma-separated value (csv) file and read it withread.csv.IThe first argument to read.csv is the name of the file. OnWindows it can be tricky to get the file path correct(backslashes need to be doubled). The best approach is touse the function file.choose which brings up a “chooser”panel through which you can select a particular file. Theidiom to remember is> mydata <- read.csv(file.choose())for comma-separated value files or> mydata <- read.delim(file.choose())for files with tab-delimited data fields.IWith an Internet connection you can use a URL (withinquotes) as the first argument to read.csv or read.delim.(See question 1 in the first set of exercises)In-built data setsIOne of the packages attached by default to an R session is thedatasets package that contains several data sets culledprimarily from introductory statistics texts.IWe will use some of these data sets for illustration.IThe Formaldehyde data are from a calibration experiment,Insectsprays are from an experiment on the effectiveness ofinsecticides.IUse ? followed by the name of a function or data set to viewits documentation. If the documentation contains an examplesection, you can execute it with the example function.The Formaldehyde data> str(Formaldehyde)’data.frame’: 6 obs. of 2 variables:\$ carb : num 0.1 0.3 0.5 0.6 0.7 0.9\$ optden: num 0.086 0.269 0.446 0.538 0.626 0.782> summary(Formaldehyde)carb optdenMin. :0.1000 Min. :0.08601st Qu.:0.3500 1st Qu.:0.3132Median :0.5500 Median :0.4920Mean :0.5167 Mean :0.45783rd Qu.:0.6750 3rd Qu.:0.6040Max. :0.9000 Max. :0.7820> Formaldehydecarb optden1 0.1 0.0862 0.3 0.2693 0.5 0.4464 0.6 0.5385 0.7 0.6266 0.9 0.782The InsectSprays data> str(InsectSprays)’data.frame’: 72 obs. of 2 variables:\$ count: num 10 7 20 14 14 12 10 23 17 20 ...\$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...> summary(InsectSprays)count sprayMin. : 0.00 A:121st Qu.: 3.00 B:12Median : 7.00 C:12Mean : 9.50 D:123rd Qu.:14.25 E:12Max. :26.00 F:12> head(InsectSprays)count spray1 10 A2 7 A3 20 A4 14 A5 14 A6 12 ACopying, saving and restoring data objectsIAssigning a data object to a new name creates a copy.IYou can save a data object to a file, typically

View Full Document