MIT 18 443 - Study Notes - D2544205

Home> Schools> Massachusetts Institute of Technology> (18) > 18 443> Study Notes

DOC PREVIEW

MIT 18 443 - Study Notes

School name Massachusetts Institute of Technology

Course 18 443- Statistics for Applications

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

MIT OpenCourseWare http://ocw.mit.edu 18.443 Statistics for Applications Spring 2009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.18.443 MEAN-SQUARE ERRORS OF ESTIMATORS: BIAS, VARIANCE, AND INFORMATION INEQUALITIES Suppose we have a parametric family of probability distributions with a likelihood function f(x, θ) for one observation, where f(x, θ) is a probability mass function for a discrete distribution or a probability density function for a continuous distributio n. Let Eθ denote expectation, and Pθ probability, when θ is the true value of the parameter. Let X = (X1, ..., Xn) be a vector of i.i.d. observati ons with distribution Pθ. Suppose g = g(θ) is a real-valued function of the parameter θ. One criterion for choosing an estimator T = T (X) o f g(θ) is to minimize the mean-squared error (MSE) Eθ((T (X) − g(θ))2). Recall that T is called an unbiased estimator of g(θ) if EθT (X) = g(θ) for all θ. More generally, the bias of T as an estimator of g(θ) is deﬁned by bT(θ) := bT,g(θ) := EθT − g(θ) for all θ. Thus T is unbiased as an estimator of g(θ) if and only if bT(θ) = 0 for all θ. If Eθ(T2) < +∞ for all θ, let Varθ(T ) be the variance of T for the given θ, which equals Eθ(T2) − (EθT )2 . The MSE equals the variance plus the bias squared, as follows: Theorem. For any statistic T ( X) such that Eθ(T2) < ∞ for all θ and any real-valued function g(θ), the mean-square error of T as an estima tor of g is given by Eθ((T (X) − g(θ))2) = Varθ(T ) + bT(θ)2 . Proof. Let h(θ) := EθT . Then we have Eθ((T (X) − g(θ))2) = Eθ((T (X) − h(θ) + h(θ) − g(θ))2) = Varθ(T ) + 2Cov((T (X) − g(θ), bT(θ)) + bT(θ)2 = Varθ(T ) + bT(θ)2 where the covariance is 0 because for given θ, bT(θ) is a constant, so the proof is complete. Q.E.D. In a classical approach, say in research from the 1930’s through the mid-1950’s and still in many textbooks, one looked at unbiased estimators, so that b(θ) ≡ 0, and then tried to minimize the variance. A lower bound for the variance of unbiased estimators, the so-called information inequality, or Cram´er-Rao inequality (Rice, Section 8.7, Theorem A, and later in this handout), proved in the 1940 ’s, was considered one of the main theorems of mathematical statistics. An estimator T (X) is called inadmissible as an estimator of g(θ) , for mean-squared error, if there is a nother estimator U(X) such that Eθ[(U( X)−g(θ))2] ≤ Eθ[(T (X)−g(θ))2] for all θ where the inequality becomes strict, with ≤ replaced by <, for some θ. If there is no such U then T is called admissible. Let’s call T (X) strongly inadmissible if we add to the the deﬁnition that Eθ[(U( X) − g(θ))2] < Eθ[(T (X) −g(θ))2] for all θ in a non-empty open set V , namely, a set such that 1� for some θ0 in V and r > 0, also θ is in V for all θ such that |θ −θ0| < r. In one dimension this would just say that V includes a non-degenerate interval. If π is a prior density with π(θ) > 0 for all θ, and T is a Bayes estimator for g(θ), namely the integral of g(θ) times the posterior density πX(θ), then T cannot be strongly inadmissible, or there would be an estima tor with smal ler overall risk (integrating mean-square error times π(θ)), contradicting the Bayes property of T , as shown in lecture Monday 3/10 (I looked but so far have not found this fact in Rice). Suppose we have a normal distribution on d-dimensional space where the coordinates xj are normal and independent with means µj and vari ance 1. The analogue of squared diﬀerence is sq uared Euclidean distance |x − y|2 = �d (xj − yj)2, so that for the mean j=1vector µ = (µ1, ..., µd), and an estimator T (X) of it, also wi th d-dimensional values, we’re aiming to minimize Eµ(|T (X) −µ|2). A surprising discovery by Charles Stein in 1956 was that although the sample mean X is an admissible estimator of the mean vector µ for d = 1 or 2, it is not for d = 3 or larger; biased estimators can do better. Detai ls are given in the 18.466 OCW notes, Section 2.7. Yatracos (2005) considered the sample variance for 1-dimensional data. Let the sample variance be deﬁned as n cn (Xj −X)2 , j=1 where we know that cn = 1/ (n−1) gives an unbiased estimat or of the variance whenever it is ﬁnite, whereas cn = 1/ n gives the maximum likelihood estimate for normal distributions and the statistic used in method-of-moments estimation. Yatracos proved the following fact: let X1, ..., Xn be i.i.d. with any distribution such that E(X4) < ∞, Xj are not 1 constant, and in a family such that for any c wi th 0 < c < ∞, the distribution of cX1 is also in the fami ly. Then the classical sample variance with cn = 1/(n − 1) is inadmissible as an estimator of the true variance. A n estimator with smaller mean-squared error is obtained by taking n + 2 cn = . n(n + 1) Of course, the resulting estimator has a non-zero bias, but the bias becomes very small as n becomes large and the reduction in variance is enough to make the total MSE smaller. The Stein and Yatracos examples are part of the reason that the information inequality is not emphasized in this course. Still, the rest of this handout will present it, ending with a form that applies when there is a bias. The Cauchy-Schwarz inequality as applied to random va riables is as follow s. It was given in the course in showing that correlations have absolute value at most 1. Fact. Let X and Y be two random variables with E(X2) < ∞ and E(Y2) < ∞. Then (E(XY ))2 ≤ E( X2)E(Y2). Equality holds if and only if X and Y are linearly dependent (one is a constant times the other). Proof. For any real t, E((X+tY )2) ≥ 0. Expanding gives E(X2)+2tE(XY )+t2E(Y2) ≥ 0. If E(Y2) = 0 then

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

MIT 18 443 - Study Notes

Sign up for free to view:

Please select your school