DOC PREVIEW
Duke STA 101 - Exploratory Data Analysis

This preview shows page 1-2-3-4-5 out of 14 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

9/10/09 1 FPP 7-9 Exploratory Data Analysis: Two Variables Exploratory data analysis: two variables  2 qualitative/categorical variables  Contingency tables (we will cover these later in the semester)  1 qualitative/categorical, 1 quantitative variables  Side-by-side box plots  2 quantitative variables  Scatter plots, correlations, regressions9/10/09 2 Box plots  A box plot is a graph of five numbers  minimum,  Maximum  Median  1st quartile  3rd quartile  We know how to compute three of the numbers (min,max,median)  To compute the 1st quartile find the median of the 50% of observations that are smaller than the median  To compute the 3rd quartile find the median of the 50% of observatins that are bigger than the median Side-by-side box plots  Box plots are very useful when comparing distributions of a quantitative variable for levels of some qualitative variable9/10/09 3 Pets and stress  Are there any differences in stress levels when doing tasks with your pet, a good friend, or alone?  Allen et al. (1988) asked 45 people to count backwards by 13s and 17s.  People were randomly assigned to one of the three groups: pet, friend, alone.  Response is subject’s average heart rate during task Pets and stress  It looks like the task is most stressful around friends and least stressful around pets9/10/09 4 Vietnam draft lottery  In 1970, the US government drafted young men for military service in the Vietnam War. These men were drafted by means of a random lottery. Basically, paper slips containing all dates in January were placed in a wooden box and then mixed. Next, all dates in February (including 2/29) were added to the box and mixed. This procedure was repeated until all 366 dates were mixed in the box. Finally, dates were successively drawn without replacement. The first data drawn (Sept. 14) was assigned rank 1, the second data drawn (April 24) was assigned rank 2, and so on. Those eligible for the draft who were born on Sept. 14 were called first to service, then those born on April 24 were called, and so on.  Soon after the lottery, people began to complain that the randomization system was not completely fair. They believed that birth dates later in the year had lower lottery numbers than those earlier in the year (Fienberg, 1971)  What do the data say? Was the draft lottery fair? Let’s to a statistical analysis of the data to find out. Draft rank by month in the Vietnam draft lottery: Raw data9/10/09 5 Draft rank by month in the Vietnam draft lottery: Box plots Exploratory data analysis two quantitative variables  Scatter plots  A scatter plot shows one variable vs. the other in a 2-dimensional graph  Always plot the explanatory variable, if there is one, on the horizontal axis  We usually call the explanatory variable x and the response variable y  If there is no explanatory-response distinction, either variable can go on the horizontal axis9/10/09 6 Example Gross Sales Items890.5 115197 17231 26170 21202.5 30225.5 35489.7 84234.8 42161.5 21284 44422 65300.7 59412.4 69346.8 5992.3 19255.8 42118.5 16286.5 39594 72263.29 43244.08 45394.28 64241.31 36299.97 40649.04 103Describing scatter plots  Form  Linear, quadratic, exponential  Direction  Positive association  An increase in one variable is accompanied by an increase in the other  Negatively associated  A decrease in one variable is accompanied by an increase in the other  Strength  How closely the points follow a clear form9/10/09 7 Describing scatter plots  Form:  Linear  Direction  Positive  Strength  Moderately strong? Correlation coefficient  We need something more than an arbitrary ocular guess at how strong an association is between two variables.  We need a value that can summarize the strength of a relationship  That doesn’t change with when units change  That makes no distinction between the response and explanatory variables9/10/09 8 Correlation coefficient Computing correlation coefficient  Let x, y be any two quantitative variables for n individuals ∑=−−−=niyixisyysxxnr111ly respectivey and x variablesthe of deviations standard are s and means are y and x ysandxwhere9/10/09 9 Correlation coefficient  Remember are standardized values of variable x and y respectively  The correlation r is an average of the products of the standardized values of the two variables x and y for the n observations yxisysxx−−iy and Properties of r  Makes no distinction between explanatory and response variables  Both variables must be quantitative  No ordering with qualitative variables  Is invariant to change of units  Is between -1 and 1  Is affected by outliers  Measures strength of association for only linear relationships!9/10/09 10 True or False  Let X be GNP for the U.S. in dollars and Y be GNP for Mexico, in pesos. Changing Y to U.S. dollars changes the value of the correlation. Correlation Coefficient is ____ 5 5 0 0 Correlation Coefficient is ____ 5 5 0 0 Correlation Coefficient is _____ 5 5 0 0 Correlation Coefficient is ____ 5 5 0 09/10/09 11 Correlation coefficient  Correlation is not an appropriate measure of association for non-linear relationships  What would r be for this scatter plot Correlation coefficient9/10/09 12 Correlation coefficient  CORRELATION IS NOT CAUSATION  A substantial correlation between two variables might indicate the influence of other variables on both  Or, lack of substantial correlation might mask the effect of the other variables Correlation coefficient  CORRELATION IS NOT CAUSATION  Plot of life expectancy of population and number of people per TV for 22 countries (1991 data)9/10/09 13 Correlation coefficient  CORRELATION IS NOT CAUSATION  A study showed that there was a strong correlation between the number of firefighters at a fire and the property damage that the fire causes.  We should send less fire fighters to fight fires right??  Example of a lurking variable what might it be? Interpreting correlations  A newspaper article contains a quote from a psychologist, who says, “The


View Full Document

Duke STA 101 - Exploratory Data Analysis

Download Exploratory Data Analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Exploratory Data Analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Exploratory Data Analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?