Setting Two Independent Samples Bret Larget I Model two populations as buckets of numbered balls I The population means are 1 and 2 respectively I The population standard deviations are 1 and 2 respectively I We are interested in estimating 1 2 and in testing the hypothesis that 1 2 Departments of Botany and of Statistics University of Wisconsin Madison Statistics 371 25th October 2005 Comparing Two Groups I Chapter 7 describes two ways to compare two populations on the basis of independent samples a confidence interval for the difference in population means and a hypothesis test I The basic structure of the confidence interval is the same as in the previous chapter an estimate plus or minus a multiple of a standard error I mean 1 sd 1 1 1 y1 yn1 y1 mean 2 sd 2 s1 2 2 y1 yn2 y2 s2 Standard Error of y 1 y 2 I The standard error of the difference in two sample means is an empirical measure of how far the difference in sample means will typically be from the difference in the respective population means s s2 s12 2 SE y 1 y 2 n1 n2 I An alternative formula is Hypothesis testing will introduce several new concepts SE y 1 y 2 q SE y 1 2 SE y 2 2 I This formula reminds us of how to find the length of the hypotenuse of a triangle I Variances add but standard deviations don t Pooled Standard Error I If we wish to assume that the two population standard deviations are equal 1 2 then it makes sense to use data from both samples to estimate the common population standard deviation I We estimate the common population variance with a weighted average of the sample variances weighted by the degrees of freedom n1 1 s12 n2 1 s22 2 spooled n1 n 2 2 I Theory for Confidence Interval I T I The pooled standard error is then as below r 1 1 SEpooled spooled n1 n2 Sampling Distributions I Y 1 Y 2 1 2 SE Y 1 Y 2 where we standardize by subtracting the mean and dividing by the standard deviation of the sampling distribution Theory cont I I Y SE Y Similarly the theory for confidence intervals for 1 2 is based on the sampling distribution of the statistic T The sampling distribution of the difference in sample means has these characteristics I The recipe for constructing a confidence interval for a single population mean is based on facts about the sampling distribution of the statistic Mean 1 2 q 2 2 SD n11 n22 Shape Exactly normal if both populations are normal approximately normal if populations are not normal but both sample sizes are sufficiently large If both populations are normal and if we know the population standard deviations then Y 1 Y 2 1 2 q 2 P 1 96 0 95 1 96 1 22 n1 n2 where we can choose z other than 1 96 for different confidence levels I This statement is true because the expression in the middle has a standard normal distribution Confidence Interval for 1 2 Theory cont I I I But in practice we don t know the population standard deviations Estimate t Multiplier SE If we substitute in sample estimates instead we get this Y 1 Y 2 1 2 q 2 P t 0 95 t s1 s22 n1 n2 I I I We need to choose different end points to account for the additional randomness in the denominator I It turns out that the sampling distribution of the statistic above is approximately a t distribution where the degrees of freedom should be estimated from the data as well I I Y 1 Y 2 t where SEi si ni for i 1 2 As a check the value is often close to n1 n2 2 This will be exact if s1 s2 and if n1 n2 The value from the messy formula will always be between the smaller of n1 1 and n2 1 and n1 n2 2 I In this example subjects with high blood pressure are randomly allocated to two treatments I The biofeedback group receives relaxation training aided by biofeedback and meditation over eight weeks I The control group does not I Reduction in systolic blood pressure is tabulated here Biofeedback Control n 99 93 y 13 8 4 0 SE 1 34 1 30 Algebraic manipulation leads to the following expression P SE21 SE22 2 SE41 n1 1 SE42 n2 1 Example Exercise 7 12 Theory cont 8 The only difference is that for this more complicated setting we have more complicated formulas for the standard error and the degrees of freedom Here is the df formula df I I The confidence interval for differences in population means has the same structure as that for a single population mean s s12 n1 s22 n2 1 2 Y 1 Y 2 t s s12 n1 s22 9 n2 0 95 We use a t multiplier so that the area between t and t under a t distribution with the estimated degrees of freedom will be 0 95 Example cont I I For 190 degrees of freedom which come from both the simple and messy formulas the table says to use 1 977 140 is rounded down whereas with R you find 1 973 A calculator or R can compute the margin of error Example cont I Examine side by side boxplots attach ex7 21 boxplot split height color se sqrt 1 34 2 1 3 2 tmult qt 0 975 190 me round tmult se 1 se 9 10 1 1 866976 tmult 8 1 1 972528 7 me 5 6 1 3 7 We are 95 confident that the mean reduction in systolic blood pressure due to the biofeedback treatment in a population of similar individuals to those in this study would be between 6 1 and 13 5 mm more than the mean reduction in the same population undergoing the control treatment Example Using R green red Example cont Exercise 7 21 I I This exercise examines the growth of bean plants under red and green light Read in the data ex7 21 read table lights txt header T I Examine the structure of the data str ex7 21 data frame 42 obs of 2 variables height num 8 4 8 4 10 8 8 7 1 9 4 8 8 4 3 9 8 4 color Factor w 2 levels green red 2 2 2 2 2 2 2 2 2 2 I Carry out t test t test height color Welch Two Sample t test data height by color t 1 1432 df 38 019 p value 0 2601 alternative hypothesis true difference in means is not equal to 0 95 percent confidence interval 0 4479687 1 6103216 sample estimates mean in group green mean in group red 8 940000 8 358824 Example Assuming Equal Variances I For the same data were we to assume that the population variances were equal the degrees of freedom the standard error and the confidence interval are all …
View Full Document