DOC PREVIEW
MIT 18 443 - Study Notes

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

MIT OpenCourseWare http://ocw.mit.edu 18.443 Statistics for Applications Spring 2009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.18.443 MEAN-SQUARE ERRORS OF ESTIMATORS: BIAS, VARIANCE, AND INFORMATION INEQUALITIES Suppose we have a parametric family of probability distributions with a likelihood function f(x, θ) for one observation, where f(x, θ) is a probability mass function for a discrete distribution or a probability density function for a continuous distributio n. Let Eθ denote expectation, and Pθ probability, when θ is the true value of the parameter. Let X = (X1, ..., Xn) be a vector of i.i.d. observati ons with distribution Pθ. Suppose g = g(θ) is a real-valued function of the parameter θ. One criterion for choosing an estimator T = T (X) o f g(θ) is to minimize the mean-squared error (MSE) Eθ((T (X) − g(θ))2). Recall that T is called an unbiased estimator of g(θ) if EθT (X) = g(θ) for all θ. More generally, the bias of T as an estimator of g(θ) is defined by bT(θ) := bT,g(θ) := EθT − g(θ) for all θ. Thus T is unbiased as an estimator of g(θ) if and only if bT(θ) = 0 for all θ. If Eθ(T2) < +∞ for all θ, let Varθ(T ) be the variance of T for the given θ, which equals Eθ(T2) − (EθT )2 . The MSE equals the variance plus the bias squared, as follows: Theorem. For any statistic T ( X) such that Eθ(T2) < ∞ for all θ and any real-valued function g(θ), the mean-square error of T as an estima tor of g is given by Eθ((T (X) − g(θ))2) = Varθ(T ) + bT(θ)2 . Proof. Let h(θ) := EθT . Then we have Eθ((T (X) − g(θ))2) = Eθ((T (X) − h(θ) + h(θ) − g(θ))2) = Varθ(T ) + 2Cov((T (X) − g(θ), bT(θ)) + bT(θ)2 = Varθ(T ) + bT(θ)2 where the covariance is 0 because for given θ, bT(θ) is a constant, so the proof is complete. Q.E.D. In a classical approach, say in research from the 1930’s through the mid-1950’s and still in many textbooks, one looked at unbiased estimators, so that b(θ) ≡ 0, and then tried to minimize the variance. A lower bound for the variance of unbiased estimators, the so-called information inequality, or Cram´er-Rao inequality (Rice, Section 8.7, Theorem A, and later in this handout), proved in the 1940 ’s, was considered one of the main theorems of mathematical statistics. An estimator T (X) is called inadmissible as an estimator of g(θ) , for mean-squared error, if there is a nother estimator U(X) such that Eθ[(U( X)−g(θ))2] ≤ Eθ[(T (X)−g(θ))2] for all θ where the inequality becomes strict, with ≤ replaced by <, for some θ. If there is no such U then T is called admissible. Let’s call T (X) strongly inadmissible if we add to the the definition that Eθ[(U( X) − g(θ))2] < Eθ[(T (X) −g(θ))2] for all θ in a non-empty open set V , namely, a set such that 1� for some θ0 in V and r > 0, also θ is in V for all θ such that |θ −θ0| < r. In one dimension this would just say that V includes a non-degenerate interval. If π is a prior density with π(θ) > 0 for all θ, and T is a Bayes estimator for g(θ), namely the integral of g(θ) times the posterior density πX(θ), then T cannot be strongly inadmissible, or there would be an estima tor with smal ler overall risk (integrating mean-square error times π(θ)), contradicting the Bayes property of T , as shown in lecture Monday 3/10 (I looked but so far have not found this fact in Rice). Suppose we have a normal distribution on d-dimensional space where the coordinates xj are normal and independent with means µj and vari ance 1. The analogue of squared difference is sq uared Euclidean distance |x − y|2 = �d (xj − yj)2, so that for the mean j=1vector µ = (µ1, ..., µd), and an estimator T (X) of it, also wi th d-dimensional values, we’re aiming to minimize Eµ(|T (X) −µ|2). A surprising discovery by Charles Stein in 1956 was that although the sample mean X is an admissible estimator of the mean vector µ for d = 1 or 2, it is not for d = 3 or larger; biased estimators can do better. Detai ls are given in the 18.466 OCW notes, Section 2.7. Yatracos (2005) considered the sample variance for 1-dimensional data. Let the sample variance be defined as n cn (Xj −X)2 , j=1 where we know that cn = 1/ (n−1) gives an unbiased estimat or of the variance whenever it is finite, whereas cn = 1/ n gives the maximum likelihood estimate for normal distributions and the statistic used in method-of-moments estimation. Yatracos proved the following fact: let X1, ..., Xn be i.i.d. with any distribution such that E(X4) < ∞, Xj are not 1 constant, and in a family such that for any c wi th 0 < c < ∞, the distribution of cX1 is also in the fami ly. Then the classical sample variance with cn = 1/(n − 1) is inadmissible as an estimator of the true variance. A n estimator with smaller mean-squared error is obtained by taking n + 2 cn = . n(n + 1) Of course, the resulting estimator has a non-zero bias, but the bias becomes very small as n becomes large and the reduction in variance is enough to make the total MSE smaller. The Stein and Yatracos examples are part of the reason that the information inequality is not emphasized in this course. Still, the rest of this handout will present it, ending with a form that applies when there is a bias. The Cauchy-Schwarz inequality as applied to random va riables is as follow s. It was given in the course in showing that correlations have absolute value at most 1. Fact. Let X and Y be two random variables with E(X2) < ∞ and E(Y2) < ∞. Then (E(XY ))2 ≤ E( X2)E(Y2). Equality holds if and only if X and Y are linearly dependent (one is a constant times the other). Proof. For any real t, E((X+tY )2) ≥ 0. Expanding gives E(X2)+2tE(XY )+t2E(Y2) ≥ 0. If E(Y2) = 0 then


View Full Document
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?