DOC PREVIEW
UW-Madison STAT 411 - ex-regr-wt

This preview shows page 1 out of 2 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Nordheim Statistics 411 Spring 2015 Illustration of Possible Issues with Sampling Weights Suppose that there are two groups of individuals with 20% belonging to group A and 80% belonging to group B. In many cases there is a desire to “oversample” individuals in group A so that the sample size for this group is “large enough” to meet certain criteria for estimation precision. Thus, imagine taking a stratified sample with a desired sample size in the range of 50 to 60. Proportional allocation could lead to a sample of 10 from group A and 40 from group B. Suppose that it is decided that group A will “oversampled” and 20 individuals will be selected from group A. If the population totals are 1000 for group A and 4000 from group B, the “weight” for each sampled person in group A is 50 (=1000/20) and in group be is 100. Suppose that values are recorded for both an “X” and “Y” variable. Group A xa: 22.8 27.7 23.6 18.7 22.1 25.1 21.5 10.0 25.8 17.9 13.6 19.8 25.0 16.3 17.6 15.4 25.7 19.1 22.8 15.3 ya: 110 130 119 94 108 117 106 59 125 94 74 98 122 83 94 84 125 94 112 80 Group B xb: 13.2 16.9 23.8 26.7 26.8 23.2 20.1 23.2 12.1 21.2 23.1 18.4 8.8 16.2 24.5 16.0 18.7 21.9 23.0 19.8 yb: 27 24 33 34 37 34 28 33 22 35 34 26 20 24 35 31 23 32 34 28 xb: 22.4 11.4 21.8 28.9 23.2 25.0 31.0 11.6 22.7 13.1 24.5 21.4 22.1 15.5 17.8 27.7 16.5 23.4 16.8 15.7 yb: 35 21 34 39 34 37 39 18 33 19 34 32 32 29 27 38 26 29 26 27 Suppose we wish to look at a regression line of y on x. Imagine placing these data into a single data file of length 60. Thus, using R notation, xx=c(xa,xb) and yy=c(ya,yb). Then, > out1=lm(yy~xx) > summary(out1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 15.3368 18.9873 0.808 0.4225 xa 1.9005 0.9101 2.088 0.0412 * --- Residual standard error: 34.94 on 58 degrees of freedom Multiple R-squared: 0.06992, Adjusted R-squared: 0.05389 F-statistic: 4.36 on 1 and 58 DF, p-value: 0.04118 However, this does not take the weights into account. One way to take weights into account is to use each observation from group B twice. Thus, xxx=c(xa,xb,xb) and yyy=c(ya,yb,yb). Then we regress yyy on xxx. (This will not give us the correct standard error but will give us the correct parameter estimates.) > out2=lm(yyy~xxx) > summary(out2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 13.3675 12.2334 1.093 0.2772 xxx 1.5288 0.5861 2.609 0.0105 * --- Residual standard error: 29.46 on 98 degrees of freedom Multiple R-squared: 0.06493, Adjusted R-squared: 0.05539 F-statistic: 6.805 on 1 and 98 DF, p-value: 0.01052 We note that the slopes are quite different. Thus, “weights” can make a big difference in the fitted model. In this case, neither of these fits is a good one. Let us consider constructing an indicator variable, “ind”, that takes on the value of “1” if the observation comes from group A and “0” if from group B. Thus, for the “original” data set with 60 observations, “ind1” will have 20 “1”s and 40 “0”s. For the “augmented” data set with 100 observations, “ind2” will have 20 “1”s and 80 “0”s. Then,> out3=lm(yy~xx+ind1+xx*ind1) > summary(out3) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.78630 1.47554 6.632 1.4e-08 *** xx 1.00179 0.07063 14.184 < 2e-16 *** ind1 9.11531 2.75345 3.311 0.00163 ** xx:ind1 3.06418 0.13220 23.178 < 2e-16 *** --- Residual standard error: 2.292 on 56 degrees of freedom Multiple R-squared: 0.9961, Adjusted R-squared: 0.9959 F-statistic: 4812 on 3 and 56 DF, p-value: < 2.2e-16 > out4=lm(yyy~xxx+ind2+xxx*ind2) > summary(out4) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.78630 1.04295 9.383 3.15e-15 *** xxx 1.00179 0.04992 20.068 < 2e-16 *** ind2 9.11531 2.54710 3.579 0.000544 *** xxx:ind2 3.06418 0.12236 25.043 < 2e-16 *** --- Residual standard error: 2.291 on 96 degrees of freedom Multiple R-squared: 0.9945, Adjusted R-squared: 0.9943 F-statistic: 5744 on 3 and 96 DF, p-value: < 2.2e-16 Now the regression estimates are exactly the same. (The standard errors in “out3” are the correct ones since there really are only 60 observations.) The take-home message is that in obtaining data from a sampling study utilizing different weights, one needs to be very careful in performing further analysis. In this case, since there are only two groups with different weights, it is fairly straightforward to perform “a reasonable analysis”. However, in studies with multiple weights, proper analysis can be very


View Full Document

UW-Madison STAT 411 - ex-regr-wt

Download ex-regr-wt
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view ex-regr-wt and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ex-regr-wt 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?