UCLA STATS 10 - Lab 2: Data Cleaning/Preparation and Visualization

Unformatted text preview:

Lab 2 Data Cleaning Preparation and Visualization Stats 10 Introduction to Statistical Reasoning Summer 2020 All rights reserved Adam Chaffee and Michael Tsiang 2017 2020 Do not post share or distribute anywhere or with anyone without explicit permission Some exercises based on labs by Nicolas Christou Objectives 1 Understand logical statements and subsetting 2 Reinforce knowledge on visualization techniques Collaboration Policy In Lab you are encouraged to work in pairs or small groups to discuss the concepts on the assignments However DO NOT copy each other s work as this constitutes cheating The work you submit must be entirely your own If you have a question in lab feel free to reach out to other groups or talk to your TA if you get stuck Intro Logical Statements Relational Operators Logical Expressions Type Comparison to see the R documentation on the list of all relational operators you can apply Many logical expressions in R use these relational operators Try running the lines of code below that use the relational operators 4 3 Is 4 greater than 3 c 3 8 3 Is 3 or 8 greater than or equal to 3 c 3 8 3 Is 3 or 8 less than or equal to 3 c 1 4 9 9 Is 1 4 or 9 exactly equal to 9 c 1 4 9 9 Is 1 4 or 9 not exactly equal to 9 Notice that the output is a logical vector i e uses TRUE and FALSE that has the length of the vector on the left of the relational statement Applications of logical statements calculations We can perform certain calculations on logical vectors because R reads TRUE as 1 and FALSE as 0 Create the NCbirths object from last lab and try these examples sum NCbirths weight 100 the number of babies that weighed more than 100 ounces mean NCbirths weight 100 the proportion of babies that weighed more than 100 ounces mean NCbirths gender Female the proportion of female babies mean NCbirths gender Male gives the proportion of babies not assigned male Applications of logical statements subsets We can combine logical statements with square brackets to subset data based on conditions Examples with NCbirths fem weights NCbirths weight NCbirths gender Female With the line above we created a vector called fem weights that contains the weights of all the female babies We can combine multiple conditions using and but these will be discussed in future labs Good coding practices Please consider implementing the following in your code 1 Use the pound symbol often to comment on different code sections Consider using them to label your exercise numbers and question parts and to help describe what your code does 2 Use good spacing Adding a space between arguments and inside of functions makes your code easier to read You can also skip lines for clarity 3 Create as many objects as you like to make it easier to follow For example consider my line above creating the fem weights object An alternative way to code this using best practices is below Create an object with the baby weights from NCbirths baby weight NCbirths weight Create an object with the baby genders from NCbirths baby gender NCbirths gender Create a logical vector to describe if the gender is female is female baby gender Female Create the vector of weights containing only females fem weights baby weight is female Exercise 1 We will be working with lead and copper data obtained from the residents of Flint Michigan from January February 2017 Data are reported in PPB parts per billion or g L from each residential testing kit Remember that Pb denotes lead and Cu denotes copper You can learn more about the Flint water crisis at https en wikipedia org wiki Flint water crisis a Download the data from CCLE and read it into R When you read in the data name your object flint b The EPA states a water source is especially dangerous if the lead level is 15 PPB or greater What proportion of the locations tested were found to have dangerous lead levels c Report the mean copper level for only test sites in the North region d Report the mean copper level for only test sites with dangerous lead levels at least 15 PPB e Report the mean lead and copper levels f Create a box plot with a good title for the lead levels g Based on what you see in part f does the mean seem to be a good measure of center for the data Report a more useful statistic for this data Exercise 2 The data here represent life expectancies Life and per capita income Income in 1974 dollars for 101 countries in the early 1970 s The source of these data is Leinhardt and Wasserman 1979 New York Times September 28 1975 p E 3 They also appear on Regression Analysis by Ashish Sen and Muni Srivastava You can access these data in R using life read table http www stat ucla edu nchristo statistics12 countries life txt header TRUE a Construct a scatterplot of Life against Income Note Income should be on the horizontal axis How does income appear to affect life expectancy b Construct the boxplot and histogram of Income Are there any outliers c Split the data set into two parts One for which the Income is strictly below 1000 and one for which the Income is at least 1000 Come up with your own names for these two objects d Use the data for which the Income is below 1000 Plot Life against Income and compute the correlation coefficient Hint use the function cor Exercises continue on the next page Exercise 3 Use R to access the Maas river data These data contain the concentration of lead and zinc in ppm at 155 locations at the banks of the Maas river in the Netherlands You can read the data in R as follows maas read table http www stat ucla edu nchristo statistics12 soil txt header TRUE a Compute the summary statistics for lead and zinc using the summary function b Plot two histograms one of lead and one of log lead c Plot log lead against log zinc What do you observe d The level of risk for surface soil based on lead concentration in ppm is given on the table below Lead free Lead safe Significant environmental lead hazard Mean concentration ppm Level of risk Below 150 Between 150 400 Above 400 Use techniques similar to last lab to give different colors and sizes to the lead concentration at these 155 locations You do not need to use the maps package create a map of the area Just plot the points without a map Exercise 4 The data for this exercise represent approximately the centers given by longitude and latitude of each one of the City of Los Angeles neighborhoods See also the Los Angeles Times project on the City of Los Angeles neighborhoods at http projects latimes com mapping la

View Full Document

UCLA STATS 10 - Lab 2: Data Cleaning/Preparation and Visualization

Download Lab 2: Data Cleaning/Preparation and Visualization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...

Join to view Lab 2: Data Cleaning/Preparation and Visualization and access 3M+ class-specific study document.

We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lab 2: Data Cleaning/Preparation and Visualization 2 2 and access 3M+ class-specific study document.


By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?