A scatterplot of Disease index (Y) and Interval (X) is:Row TotalNeither Row TotalNeither Row TotalNeither Row TotalNeither Contingency Coefficient: The contingency coefficient is a measure of association derived from the Pearson chi-square. The contingency coefficient is computed as . It has the range where m = min(r, c).Topic (14) – Bivariate Populations 14-1 Topic (14) – INFERENCES ABOUT POPULATIONS OF TWO VARIABLES Example 1: In the study of cave species richness, for each county in the SE USA we collected Y = species richness (number of distinct species) and X = number of caves. Question: are X and Y positively related, i.e. does a larger number of caves in an area predict that we would see a higher number of distinct species in that area? How strong is that relationship? Note that the relationship does look positive but also possibly non-linear. total010203040506070cavenum0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 Plot of n = 1622 pairs of Y = species richness (“total”) against X = number of caves (“cavenum”) The same plot but with the outlier removed allows us to see the relationship a bit better:Topic (14) – Bivariate Populations 14-2 total01020304050cavenum0 100 200 300 400 500 600 700 800 Same plot but with the outlier (X=1526, Y=66) removed Example 2: Is a college student’s drug use related to the use of alcohol or psychoactive drugs by his/her parents? Here we would let X be the categorical variable student use (yes or no) and let Y be the categorical variable parent use of alcohol or drugs (neither parent, one parent, or both parents).Topic (14) – Bivariate Populations 14-3 When studying bivariate populations we have four possible combinations of pairs of random variables: X Y Categorical Quantitative Categorical Example 2 Quantitative Example 1 We are going to look at the two types exemplified by the two examples. In both cases we are interested in testing whether the two variables are INDEPENDENT. In addition, if they are not independent, we would like a measure of how strong the relationship is between X and Y. These measures are called correlation coefficients or measures of association. Throughout we consider the data to be a random sample of bivariate pairs (x, y) measured on each sampling unit. The population then is bivariate, i.e. it is the set of all possible pairs of (X, Y) values.Topic (14) – Bivariate Populations 14-4 INDEPENDENCE OF TWO QUANTITATIVE VARIABLES A) Numerical summaries of the strength of the relationship between X and Y 1) Pearson’s Correlation Coefficient -1.5-1.0-0.50.00.51.01.52.0-2.5 -1.5 -0.5 .5 1.0 1.5 2.0X -2.0-1.5-1.0-0.50.00.51.01.52.0-2.5 -1.5 -0.5 .5 1.0 1.5 2.0X -2.0-1.5-1.0-0.50.00.51.01.52.0-2.5 -1.5 -0.5 .5 1.0 1.5 2.0X -2.5-2.0-1.5-1.0-0.50.00.51.01.52.0-2.5 -1.5 -0.5 .5 1.0 1.5 2.0XTopic (14) – Bivariate Populations 14-5 Which graphic shows the strongest relationship between Y and X? the weakest? Order them from weak to strong. Which show a positive relationship? A negative relationship? Are any non-linear relationships? Estimation The Sample Pearson’s Correlation Coefficient, r, is a quantitative assessment of the strength and direction of a linear relationship between 2 variables, i.e. we assume that, if a relationship exists, it is linear. Rules: 1) if the relationship is positive (slope>0), r > 0 if the relationship is negative (slope<0), r < 0 2) if there is no relationship (slope = 0), r = 0 3) the size of r does not depend on the size of the slope 3) if the relationship is perfect (every point falls exactly on a straight line), r = ± 1 depending on the sign of the slope ⇒ the stronger the relationship the closer r is to ± 1. The weaker the relationship, the closer r is to 0.Topic (14) – Bivariate Populations 14-6 To calculate r: ∑∑==−=−−−=niyxniyxiiiizznssyyxxnr11)1(1))(()1(1 where sx is the sample standard deviation of X and sy is the sample standard deviation of Y. The sample correlation coefficient r is an unbiased estimator of the population Pearson’s correlation coefficient, which is denoted with the Greek letter rho ρ. EXAMPLE Intolerant peas. Peas are known to be a self-intolerant crop because repeated planting in the same field makes them susceptible to root rot diseases. In a study to examine how the crop interval (# years the field does not have a pea or other legume crop) is related to the severity of root rot disease in pea crops, soil samples from ten fields currently planted in peas were obtained. For each field, they recorded the number of years since the last pea crop (“Interval”) and an index of the level of disease (“Index”; higher value = more disease). The data are:Topic (14) – Bivariate Populations 14-7 Interval Index Field Block (ID) 0 4.5 B 4 3.7 L 6 4 M 6 3 A 6 3.1 W 8 2.8 QQ 9 1.9 R 9 3 D 9 2.3 DD 14 1.6 T A scatterplot of Disease index (Y) and Interval (X) is: Note that: 1) there does appear to be a relationship: as the interval between plantings increase, disease levels in the subsequent crop is lessTopic (14) – Bivariate Populations 14-8 2) the relationship looks more or less linear (Y = a + bX) 3) the points do NOT fall exactly on a straight line (the relationship is not purely deterministic) 4) observations with the same X-value have differing Y-values (not a perfect relationship – other variables may provide additional explanation) From the graph the relationship looks linear, so it is appropriate to use Pearson’s correlation. Further, we expect the value of r to be negative. If we were going to do the calculations by hand: 9097.0,99.2,6953.3,10.7 ====yxsysx Field X Y ZXZYZXZYB 0 4.5 -1.921 1.659 -3.189 L 4 3.7 -0.838 0.780 -0.654 M 6 4 -0.297 1.110 -0.330 A 6 3 -0.297 0.010 -0.003 W 6 3.1 -0.297 0.120 -0.03 QQ 8 2.8 0.2435 -0.20 -0.05 R 9 1.9 0.5141 -1.19 -0.61 D 9 3 0.5141 0.010 0.01 DD 9 2.3 0.5141 -0.758 -0.39 T 14 1.6 1.8672 -1.527 -2.85 Sum 71
View Full Document