Math Camp 2: Probability TheorySasha Rakhlinσ-algebraA σ-algebra Σ over a set Ω is a collection of subsets of Ωthat is closed under countable set operations:1. ∅ ∈ Σ.2. E ∈ Σ then so is the complement of E.3. If F is any countable collection of sets in Σ, then theunion of all the sets E in F is also in Σ.MeasureA measure µ is a function defined on a σ-algebra Σ over aset Ω with values in [0, ∞] such that1. The empty set has measure zero: µ(∅) = 02. Countable additivity: if E1, E2, E3, ... is a countablesequence of pairwise disjoint sets in Σ,µ∞[i=1Ei=∞Xi=1µ(Ei)The triple (Ω, Σ, µ) is called a measure space.Lebesgue measureThe Lebesgue measure λ is the unique complete translation-invariant measure on a σ-algebra containing the intervalsin IR such that λ([0, 1]) = 1.Probability measureProbability measure is a positive measure µ on the mea-surable space (Ω, Σ) such that µ(Ω) = 1.(Ω, Σ, µ) is called a probability space.A random variable is a measurable function X : Ω 7→ IR.We can now define probability of an eventP (event A) = µ³{x : IA(x)= 1}´.Expectation and varianceGiven a random variable X ∼ µ the expectation isIEX ≡ZXdµ.Similarly the variance of the random variable σ2(X) isvar(X) ≡ IE(X − IEX)2.ConvergenceRecall that a sequence xnconverges to the limit xxn→ xif for any ² > 0 there exists an N such that |xn− x| < ² forn > N.We say that the sequence of random variables Xncon-verges to X in probabilityXnP−→ XifP(|Xn− X| ≥ ε)→ 0for every ² > 0.Convergence in probability and almostsurelyAny event with probability 1 is said to happen almostsurely. A sequence of real random variables Xnconvergesalmost surely to a random variable X iffP³limn→∞Xn= X´= 1.Convergence almost surely implies convergence in proba-bility.Law of Large Numbers. Central LimitTheoremWeak LLN: if X1, X2, X3, ... is an infinite sequence of i.i.d.random variables with finite variance σ2, thenXn=X1+ ··· + XnnP−→ IEX1In other words, for any positive number ², we havelimn→∞P³¯¯¯Xn− IEX1¯¯¯≥ ε´= 0.CLT:limn→∞PrÃXn− µσ/√n≤ z!= Φ(z)where Φ is the cdf of N(0, 1).Useful Probability InequalitiesJensen’s inequality: if φ is a convex function, thenφ(IE(X)) ≤ IE(φ(X)).For X ≥ 0,IE(X) =Z∞0Pr(X ≥ t)dt.Markov’s inequality: if X ≥ 0, thenPr(X ≥ t) ≤IE(X)t,where t ≥ 0.Useful Probability InequalitiesChebyshev’s inequality (second moment): if X is arbitraryrandom variable and t > 0,Pr(|X − IE(X)| ≥ t) ≤var(X)t2.Cauchy-Schwarz inequality: if IE(X2) and IE(Y2) are finite,then|IE(XY )| ≤qIE(X2)IE(Y2).Useful Probability InequalitiesIf X is a sum of independent variables, then X is betterapproximated by IE(X) than predicted by Chebyshev’s in-equality. In fact, it’s exponentially close!Hoeffding’s inequality:Let X1, ..., Xnbe independent bounded random variables,ai≤ Xi≤ bifor any i ∈ 1...n. Let Sn=Pni=1Xi, then forany t > 0,Pr(|Sn− IE(Sn)| ≥ t) ≤ 2expÃ−2t2Pni=1(bi− ai)2!Remark about supNote that the statementwith prob. at least 1 − δ , ∀f ∈ F, |IEf −1nnXi=1f(zi)| ≤ ²is different from the statement∀f ∈ F, with prob. at least 1 − δ , |IEf −1nnXi=1f(zi)| ≤ ².The second statement is an instance of CLT, while the firststatement is more complicated to prove and only holds forsome certain function classes.Playing with ExpectationsFix a function f, loss V , and dataset S = {z1, ..., zn}. Theempirical loss of f on this data is IS[f] =1nPni=1V (f, zi).The expected error of f is I[f] = IEzV (f, z). What is theexpected empirical error with respect to a draw of a set Sof size n?IESIS[f] =1nnXi=1IESV (f, zi) = IESV (f,
View Full Document