EECS 70 Discrete Mathematics and Probability TheorySpring 2014 Anant Sahai Note 19Deviations: small and largeOften, our goal is to understand how collections of a large number of independent random variables behave.This is what the laws of large numbers reveal. In general, the idea is that the average of a large number ofi.i.d. random variables will approach the expectation. Sometimes that is enough, but usually, we also needto understand how fast does the average converge to the expectation. We saw this already in the context ofpolling — we needed to understand how many people to poll in order to get a trustworthy high-precisionestimate of what the population is like.So far, our main tools have been the Markov and Chebyshev inequalities:(Markov) If X ≥ 0, then Pr(X ≥ a) ≤E[X]a(Chebyshev) If Xiare i.i.d., then Pr 1nn∑i=1Xi−E[X1]≥ ε!≤Var(X1)nε2Taking the limit as n goes to infinity, Chebyshev’s inequality implies that if the Xiare i.i.d, thenPr 1nn∑i=1Xi−E[X1]≥ ε!→n→∞0so the average of the Xiconverges to the expectation E[X1]. This is called the weak law of large numbers.The role of precision is played by the tolerance ε above. The measure of trustworthiness is how smallthe probability of exceeding the tolerance gets. For the sake of simplicity, let us use the reciprocal of theprobability of being outside of our desired precision as our measure of confidence. Looking at Chebyshev’sinequality, we are tempted to draw two conclusions about what happens as n gets large:• If we fix the desired confidence (pick a suitably low probability of our estimate being incorrect), thenthe precision ε seems to be improving like√Var(X )√n.• If we fix the precision ε, then the confidence improves linearly in n with a slope that depends on theprecision.It turns out that the first of these is essentially correct, while the second is often overly pessimistic. The firstis covered by the central limit theorem and is often referred to as the study of “small deviations.” Later inthis lecture, we will see that actually, the confidence improves exponentially in n. That is explored usingwhat are called Chernoff Bounds and the general area is referred to as the study of “large deviations.” Thedifference between “small” and “large” is that the small deviations are shrinking with n.EECS 70, Spring 2014, Note 19 1The Central Limit Theorem: studying “small” deviationsThe Central Limit Theorem has a long and illustrious history. At an intuitive level, it says that the appropriately-scaled1sum of a bunch of independent random variables behaves like a Gaussian (AKA Normal) randomvariable.Theorem 19.1: (Central Limit Theorem) Let Xibe i.i.d. random variables with E[Xi] = µ and Var(Xi) = σ2,and defineZn=∑ni=1Xi−nµ√n ·σ2.Then for for all z we have(CLT) limn→∞Pr(Zn≤ z) =Zz−∞1√2πe−x2/2dx.We say that Zn→ N(0,1): Znconverges in probability towards the standard Gaussian with mean 0 andvariance 1.Notice that the convergence above is in exactly the same sense (the CDF of Znconverges) as the convergenceof the appropriately scaled geometric random variables to a continuous-valued exponential random variable.The Berry-Esséen inequality gives a more precise quantification of the speed of this convergence:(Berry-Esséen)Pr(Zn≤ z) −Zz−∞1√2πe−x2/2dx≤0.77E[|X1−E[X1]|3](Var(X1))3/2√n.The convergence can be much faster than that2, but if all we know is that there exists a third momentE[|X1−E[X1]|3], then that’s pretty much the best that we can have. As a rule of thumb, you should be waryabout using the CLT in the context of probabilities that are smaller than O(1/√n).The CLT construction of Znis done specifically so that its mean is always zero and variance is always 1. Thefact that the CLT works validates theqVar(X )nscaling of precision that even Chebyshev’s coarser inequalitypredicted.Example 1. (From Bertsekas and Tsitsiklis) We have 100 bags, with weight distributed uniformly in[5,50], so that the mean weight is 27.5lbs and the variance is 168.75lbs. Can we upper bound the prob-ability that the total weight is larger than 3000lbs?For comparison, let’s see what happens if we use Chebyshev’s inequality,Pr |1100100∑i=1Xi−2750100| ≥ 2.50!≤168.75100 ·(2.5)2≈ 0.27.That is a strict inequality, but intuitively it is overestimating the probability by a factor of two since it is alsoincluding the case of the average being significantly smaller than the mean as well.1Another way of interpreting the issue of appropriately scaling is to say that the sum of a bunch of appropriately small indepen-dent random variables behaves like a Gaussian. This is used to justify Gaussian distributions for thermal noise that results from thecombined random motion of many molecules or electrons.2You are strongly encouraged to use a computer to plot for yourself how fast the CDF converges for the following example: Xiuniform. The case of Bernoulli random variables is illustrated at the end of this note. Both of these converge quite rapidly.EECS 70, Spring 2014, Note 19 2If we use the CLT instead,Pr 100∑i=1Xi≥ 3000!= 1 −Pr 100∑i=1Xi< 3000!= 1 −Pr∑100i=1Xi−2750√100 ·168.75<250√100 ·168.75≈ 1 −PrZ <250√100 ·168.75≈ 0.027.where Z is a random variable distributed as N(0,1). So the CLT predicts that the probability of having atotal weight of over 3000lbs is about 3%, which is much lower than the 27% bound (or 13% if we divide bytwo) that we got from Chebyshev’s inequality!It turns out that the Normal approximation here is very good even though the CLT is not a bound and 3% isnot an appropriately low number according to the Berry-Esséen inequality. This is because the CLT is verygood for the sums of bounded random variables like the uniform.Remark. In order to use the CLT to get easily calculated bounds, the following approximations will oftenprove useful: for any z > 0,1 −1z2e−z2/2z√2π≤Z∞z1√2πe−x2/2dx ≤e−z2/2z√2π.This way, you can approximate the tail of a Gaussian even if you don’t have a calculator capable of doingnumeric integration handy. It also reveals how the tail scales in a “closed form” way. We will use this later.Example 2. Imagine a polling situation, in which we assume that people have a probability p of supportingObama. Suppose we poll n people, and we want
View Full Document