122S:166Introduction to the BootstrapLecture 8September 19, 2011Kate Cowles374 SH, [email protected]• Efron, B. (1982) The Jackknife, the Boot-strap, and Other Resampling Plans. Num-ber 38 i n CBMS-NSF Regional ConferenceSeries in Applied Mathematics. Philadelphia:SIAM.• Efron, B. and Tibshirani, R.J. (1993) AnIntroduction to the Bootstrap. New York:Chapman & Hall.• Davison, A.c. and Hinkley, D.V. (1997) Boot-strap Methods and their Application, NewYork: Cambri dge University Press.• materials listed under Web Resources3Review concepts• suppose we have one sample of n data values:y1, . . . , yn• sample values considered outcomes of i.i.d.random variabl es Y1, . . . , Yn• probab ility density function (pdf) or proba-bility mass function (pmf) f• cumulative distribution function (cdf) F• sample will be used to make inference– about population characteristic θ– using statistic T whose value in sample ist• questions of interest regardin g T– bias?– standard error?– quantiles?– how to compute confidence limits for θ?4– likely values under a null hypothesis of in-terest?5Two classes of statistical methods• parametric– particular mathematical model for behav-ior of random varia bles Yj– pdf or pmf f is completely determined byvalu es of un know n parameters ψ– quantity of interest in statistical an a lysisθ is a component or function of ψ• nonpa ra metric– uses only the fact the Yjs are i.i.d.– no mathematical model for their distribu-tion– (may be useful to do a nonparametericanalysis even if a reasonable parametricmodel exists)∗ to assess sensitivity of con c lusions to as-sumptions of parametric mo del6The empirical distribution• puts probability mass1nat each sample valueyj• empirical distribu ti on function (edf) orˆF– nonparametric mle of F– sample proportionˆF (y) =#{yj≤y}n∗ where # denotes the number of items ina set• edf plays role of fitted mod el when no math-ematical form is assumed for F7Example of edf> library(QRMlib)> help(edf)> data <- sort(rnorm(100) )> plot( data, edf(data), type = "s" )> qs <- seq(-2.5,2.5,by=0.005)> lines( qs, pnorm(qs), lty = 2 )8Example for the nonparametric bootstrap:City population data• for each of n = 49 U.S. cities, two data values– uj= population in 1920 (in 1000s)– xj= population in 1930 (in 1000s)• popula ti on of interest is all U.S. cities• the 49 cities a re assumed to be a simple ran-dom sample from this population• define (U,X) as pair of po pulation values fora randomly selected city• then if we knew θ =E(X)E(U)and th e total 1920population for the U.S., we could estimatethe total 1930 population of U.S.• want to estimate θ without assuming anyparametric model for X and U• sample-based statistic is T =¯X¯U9• observations 1 to 10 of this dataset are in-cluded with the boot package for R10> library(boot)> data(city)> cityu x1 138 1432 93 1043 61 694 179 2605 48 756 37 637 29 508 23 489 30 11110 2 5011The non-parametric bootstrap• goal: to get an idea of the sampli ng distribu-tion of the statistic T under repeated sam-pling from the population of interest• basic idea: our sample d ata gives us all theinformation we have about the whole popu-lation• steps:1. calculate statisti c of interest (call itˆθ) fromdataset as a whole2. fit edfˆF3. Draw a “bootstrap sample” fromˆF andcalculate statistic of interest on bootstrapsample– i.e., draw a sample of size n from originaldataset with replacement– Y∗1, Y∗2, . . . , Y∗n∼ˆF–ˆθ∗=ˆθ(Y∗1, Y∗2, . . . , Y∗n)124. repeat step 2 independently a large num-ber B of times obtaining bootstrap repli-cationsˆθ∗1,ˆθ∗2, . . . ,ˆθ∗B5. Use bootstrap replicatio ns to:– estimate standard error ofˆθ– estimate bias– obtain confidence interval13Using the R sample function to drawbootstrap samplessample package:base R DocumentationRandom Samples and PermutationsDescription:’sample’ takes a sample of the specified size from theelements of ’x’ using either with or without replacement.Usage:sample(x, size, replace = FALSE, prob = NULL)Arguments:x: Either a (numeric, complex, character or logical)vector of more than one element from which to choose,or a positive integer.size: non-negative integer giving the number of items tochoose.replace: Should sampling be with replacement?prob: A vector of probability weights for obtaining the14elements of the vector being sampled.> x <- seq(1:25)> x[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1819 20 21 22 23 24 25> sample(x, 25)[1] 2 20 3 9 6 8 15 10 23 1 19 25 12 21 14 4 13 2417 5 11 18 7 22 16> sample(x, 25, replace = TRUE)[1] 4 6 16 11 21 17 6 12 5 8 15 19 23 16 15 20 18 1921 5 25 7 8 20 3> mindex <- sample(1:10, replace=T)> mindex[1] 4 9 1 9 10 9 6 5 3 3> city[mindex, ]u x4 179 2609 30 1111 138 1439.1 30 11110 2 509.2 30 1116 37 635 48 753 61 69153.1 61 6916Bias correction using the bootstrap• notation– θ – true and unknown population quantityvalu e–ˆθ – estimate of θ based on sample data–ˆθ∗b– estimate of θ from b-th bootstrapsample17Bias correction continued• So in a sense:–ˆθ∗s are toˆθ asˆθ is to θ• bootstrap esimate of bias– Note: bia s = EF(ˆθ − θ)dbiasboot=1BBXb=1ˆθ∗b−ˆθ=ˆθ∗.−ˆθ• So bias-corrected point estimate is˜θ =ˆθ −ˆθ∗.−ˆθ= 2ˆθ −ˆθ∗.18R code for the City Data> library(boot)> help(boot, package="boot")------------------------------------------------------------------------------boot package:boot R DocumentationBootstrap ResamplingDescription:Generate ’R’ bootstrap replicates of a statistic applied to data.Both parametric and nonparametric resampling are possible. Forthe nonparametric bootstrap, possible resampling methods are theordinary bootstrap, the balanced bootstrap, antitheticresampling, and permutation. For nonparametric multi-sampleproblems stratified resampling is used. This is specified byincluding a vector of strata in the call to boot. Importanceresampling weights may be specified.Usage:boot(data, statistic, R, sim="ordinary", stype="i",strata=rep(1,n), L=NULL, m=0, weights=NULL,ran.gen=function(d, p) d, mle=NULL, ...)Arguments:data: The data as a vector, matrix or data frame. If it is amatrix or data frame then each row is considered as onemultivariate observation.statistic: A function which when applied to data returns a vectorcontaining the
View Full Document