Chapter 19 Comparing Two Numerical Response Populations Independent Samples Chapter 19 is very much like Chapter 15 The major and obvious difference is that in the earlier chapter the response was a dichotomy but in this chapter the response is a number If you revisit the material on the four types of studies in Section 15 2 you can see that the fact that the response was a dichotomy is irrelevant In other words everything you learned earlier about how the interpretation of an analysis depends on the type of study remains true in this chapter In particular for an observational study if you conclude that numerical populations differ then you don t know based on the statistical analysis why they differ On the other hand for an experimental study if you conclude that numerical populations differ then you may conclude that there is a causal link between the treatment and response In addition the meaning of independent random samples for the different types of studies remains the same in the current chapter There is even an extension of Simpson s Paradox for a numerical response but time limitations will prevent me from covering this topic It is also true that Chapter 19 builds on the work of Chapters 17 and 18 In particular recall that in Chapter 17 you learned that the population for a numerical response is a picture and the kind of picture depends on whether the response is a count or a measurement 19 1 Notation and Assumptions The researcher has two populations of interest The methods of Chapters 17 and 18 may be used to study the populations separately In this chapter you will learn how to compare the populations Population 1 has mean 1 variance 12 and standard deviation 1 Population 2 has mean 2 variance 22 and standard deviation 2 I realize that specifying both the variance and standard deviation is redundant but it will prove useful to have both for some of the formulas we develop We will consider procedures that compare the populations by comparing their means 495 We assume that we will observe n1 i i d random variables from population 1 denoted by X1 X2 X3 Xn 1 These will be summarized by their mean X variance S12 and standard deviation S1 The observed values of these various random variables are denoted by x1 x2 x3 xn1 x s21 and s1 respectively We assume that we will observe n2 i i d random variables from population 2 denoted by Y 1 Y 2 Y 3 Y n2 These will be summarized by their mean Y variance S22 and standard deviation S2 The observed values of these various random variables are denoted by y1 y2 y3 yn2 y s22 and s2 respectively We assume that the two samples are independent I apologize for the cumbersome and confusing notation In particular in my s 2 s n s S 2 s and so on I use a subscript to denote the population either 1 or 2 this is very user friendly You need to remember however that the random variables data and some summaries from population 1 are denoted by x s and the corresponding notions from population 2 are denoted by y s There is a long tradition of doing things this way in introductory Statistics While it is confusing its one virtue is that it allows you to avoid double subscripts until you take a more advanced Statistics class Enrichment Here is the problem with double subscripts well other than the obvious problem that they sound and are complicated If I write x123 does it mean Observation number 123 from one source of data Observation 23 from population 1 or Observation 3 from population 12 This could be made clear with commas use x1 23 for the second answer above and x12 3 for the third answer The only problem is In my experience statisticians and mathematicians don t want to be bothered with commas The methods introduced in this chapter involve comparing the populations by comparing their means For tests of hypotheses this translates to the null hypothesis being 1 2 or equivalently 1 2 0 For estimation 1 2 is the feature that will be estimated with confidence Our point estimator of 1 2 is X Y There is a Central Limit Theorem for this problem just as there was in Chapter 17 First it shows us how to standardize our estimator X Y 1 2 W q 12 n1 22 n2 496 19 1 Second it states that we can approximate probabilities for W by using the N 0 1 curve and that in the limit as both sample sizes become larger and larger the approximations are accurate In order to obtain formulas for estimation and testing we need to eliminate the unknown parameters in the denominator of W 12 and 22 We also will need to decide what to use for our reference curve the N 0 1 curve of the Central Limit Theorem and Slutsky or one of the t curves of Gosset Statisticians suggest three methods for handling these two issues which I refer to as Cases 1 2 and 3 I won t actually show you Case 3 because I believe that it nearly worthless to a scientist I will explain why I feel this way We will begin with Case 1 I will follow the popular terminology and call this the large sample approximate method 19 2 Case 1 The Slutsky Large Sample Approximate Method This method comes from Slutsky s Theorem In Equation 19 1 for W replace each population variance by its sample variance The resultant random variable is X Y 1 2 W1 q S12 n1 S22 n2 19 2 Note that I have placed the subscript of 1 on W to remind you that this is for Case 1 It can be shown that in the limit as both sample sizes grow without bound the N 0 1 pdf provides accurate probabilities for W1 Thus for finite values of n1 and n2 the N 0 1 pdf will be used to obtain approximate probabilities for W1 As a general guideline I recommend using Case 1 only if n1 30 and n2 30 The usual algebraic manipulation of the ratio that is W1 yields the following result Result 19 1 Slutsky s approximate confidence interval estimate of 1 2 With the notation and assumptions given in Section 19 1 Slutsky s approximate confidence interval estimate of 1 2 is q x y z s21 n1 s22 n2 19 3 As always in these intervals the value of z is determined by the desired confidence level and can be found in Table 12 1 on page 296 Before I give you an example of the use of Formula 19 3 I will tell you about the test of hypotheses for this section As I stated earlier in this chapter the null hypothesis is 1 2 or equivalently 1 2 0 There are three options for the alternative H1 1 2 H1 1 2 or H1 1 6 2 I will abbreviate these as and 6 no confusion should result …
View Full Document