**Unformatted text preview:**

University of California Los Angeles Department of Statistics Statistics 12 Instructor Nicolas Christou Data analysis with R Some simple commands When you are in R the command line begins with To read data from a website use the following command a read table http www stat ucla edu nchristo statistics12 body fat txt header TRUE The result of the command read table is a data frame it looks like a table In our example we give the name data to our data frame The columns of a data frame are variables This le contains data on percentage of body fat determined by underwater weighing and various body circumference measurements for 251 men Here is the variable description Variable Description x1 y x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 Density determined from underwater weighing Percent body fat from Siri s 1956 equation Age years Weight lbs Height inches Neck circumference cm Chest circumference cm Abdomen 2 circumference cm Hip circumference cm Thigh circumference cm Knee circumference cm Ankle circumference cm Biceps extended circumference cm Forearm circumference cm Wrist circumference cm If the data le is on your computer e g on your desktop rst you need to change the working directory by clicking on Misc at the top of your screen and then read the data as follows a read table filename txt header TRUE Note the expression is an assignment operator Once we read the data we can display them by simply typing at the command line a Or if we want we can display the rst 6 rows of the data by typing head a Here is the output head a x5 x4 x7 x1 x6 y x3 x15 1 1 0853 6 1 22 173 25 72 25 38 5 93 6 83 0 98 7 58 7 37 3 23 4 30 5 28 9 18 2 2 1 0414 25 3 22 154 00 66 25 34 0 95 8 87 9 99 2 59 6 38 9 24 0 28 8 25 2 16 6 3 1 0754 10 3 23 188 15 77 50 38 0 96 6 85 3 102 5 59 1 37 6 23 2 31 8 29 7 18 3 4 1 0722 11 7 23 198 25 73 50 42 1 99 6 88 6 104 1 63 1 41 7 25 0 35 6 30 0 19 2 5 1 0708 12 3 23 154 25 67 75 36 2 93 1 85 2 94 5 59 0 37 3 21 9 32 0 27 4 17 1 6 1 0775 9 4 23 159 75 72 25 35 5 92 1 77 1 93 9 56 1 36 1 22 7 30 5 27 2 18 2 x10 x11 x12 x13 x14 x9 x8 1 Useful commands Extracting one variable from the data frame e g the second variable a 2 Another way to extract a variable a y Similarly if we want to access a particular row in our data e g rst row a 1 To list all the data simply type a To compute the mean of all the variables in the data set mean a To compute the mean of just one variable mean a y To compute the mean of variables 2 and 3 mean a c 2 3 To compute the variance of one variable var a y To compute the variance covariance matrix of all the variables cov a To compute the variance covariance matrix of all the variables except the rst variable To compute the variance covariance matrix of variables 1 2 and 3 cov a c 1 2 3 cov a 1 or cov a 1 3 To compute the variance covariance matrix of variables 1 2 and 5 cov a c 1 2 5 To compute the correlation matrix As above replace cov with cor for example cor data c 1 2 3 To compute summary statistics for all the variables summary a To construct stem and leaf plot histogram boxplot stem a y boxplot a y hist a y To plot variable y against variable x10 plot a x10 a y 2 And you can give names to the axes and to your plot plot a x10 a y main Scatterplot of percent body fat against thigh circumference ylab Percent body fat xlab Thigh circumference And here is the plot To save a plot as a pdf le under the working directory e g your desktop pdf box pdf boxplot a y dev off hist boxplot Exercise A box plot of the variable y can be found on your current working directory with the name box pdf If you want to read more about a speci c command for example about histograms and boxplots at the command line you type the following Construct the same plots with di erent variables and save them on your desktop 3 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll50607080010203040Scatterplot of percent body fat against thigh circumferenceThigh circumferencePercent body fat Create multiple graphs on one page Suppose 9 graphs 3 3 pdf plot9 pdf par mfrow c 3 3 hist a y boxplot a x10 plot a x10 a y boxplot a y boxplot a x9 hist a x1 plot a x13 a y hist a x9 boxplot a x13 dev off And here is the plot Create subsets The following simple commands will create subsets of the original data frame a a1 a 1 3 A new data frame with only the first three columns a2 a c 1 3 8 10 A new data frame with columns 1 2 3 8 10 4 Histogram of a ya yFrequency010203040500103050llll50607080lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll05010015020025050607080Indexa x10l010203040lll90110130150Histogram of a x1a x1Frequency1 001 041 08010203040lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll2530354045010203040a x13a yHistogram of a x9a x9Frequency90110130150020406080l2530354045 Another data set The following data were collected in the area west of the town Stein in the Netherlands near the river Meuse Dutch Maas river see map below The actual data set contains many variables but here we will use the x y coordinates and the concentration of lead and zinc in ppm at each data point The motivation for this study was to predict the concentration of heavy metals around the banks of the Maas river in this area These heavy metals were accumulated over the years because of the river pollution Here is the area of study 5 Exercise a You can access these data using b read table http www stat ucla edu nchristo statistics12 soil txt header TRUE b Construct the stem and leaf plot histrogram and boxplot for each one of the two variables lead and zinc and compute the summary statistics What do you observe c Transform the data in order to produce a symmetrical histrogram Here is what you can do log lead log b lead log zinc log b zinc Construct the stem and leaf plot histrogram and boxplot for each one of the new vari ables log lead and log zinc and compute the summary statistics What do you observe now Here is

View Full Document