CS731 Spring 2011 Advanced Artificial IntelligenceStatistical Decision TheoryLecturer: Xiaojin Zhu [email protected] a parameter θ ∈ Θ. We observe data x sampled from the distribution parametrized by θ. Letˆθ ≡ˆθ(x) be an estimator of θ based on data x. We are going to compare different estimators.Let a loss function L(θ,ˆθ) : Θ × Θ 7→ R+be defined. For example,L(θ,ˆθ) = (θ −ˆθ)2(1)L(θ,ˆθ) =0 θ =ˆθ1 θ 6=ˆθ(2)L(θ,ˆθ) =Zp(x; θ) log p(x; θ)p(x;ˆθ)!dx (3)The risk R(θ,ˆθ) is the average loss , averaged over training sets sampled from the true θ:R(θ,ˆθ) = Eθ[L(θ,ˆθ(x))] =Zp(x; θ)L(θ,ˆθ(x))dx (4)Recall that Eθmeans the expectation over x drawn from the distribution with fixed parameter θ, not theexpectation over different θ.Example 1 Let X ∼ N(θ, 1). Letˆθ1= X andˆθ2= 3.14. Assume squared error loss. Then R(θ,ˆθ1) = 1(hint: variance), R(θ,ˆθ2) = Eθ(θ −3.14)2= (θ −3.14)2. (hint: no X here) Over the whole range of possibleθ ∈ R, neither estimator consistently dominates.Example 2 Let X1, . . . , Xn∼ Bernoulli(θ). Consider squared error loss. Letˆθ1=PXin, the sample mean.Letˆθ2=α+PXiα+β+nwhich is the “smoothed” estimate, i.e., the posterior mean under a Beta(α, β) prior. Letˆθ3= X1, the first sample. Then, R(θ,ˆθ1) = V(PXin) =θ(1−θ)nand R(θ,ˆθ3) = V(X1) = θ(1 − θ). Soˆθ3isout as a learning algorithm. But what aboutˆθ2?R(θ,ˆθ2) = Eθ(θ −ˆθ2)2(5)= Vθ(ˆθ2) + (bias(ˆθ2))2(6)=nθ(1 − θ)(n + α + β)2+nθ + αn + α + β− θ2(7)It is not difficult to show that one can make θ disappear from the risk (i.e., task insensitivity) by settingα = β =√n/2withR(θ,ˆθ2) =14(√n + 1)2It turns out this particular choice of α, β leads to a so-called minimax estimatorˆθ2, as we will show later.However, there is no dominance betweenˆθ1andˆθ2as the figure below shows:1Statistical Decision Theory 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100.050.10.150.20.25θ hatθ1 risk, n=1hatθ2 risk, n=1hatθ1 risk, n=10hatθ2 risk, n=10hatθ1 risk, n=30hatθ2 risk, n=30The maximum risk isRmax(ˆθ) = supθR(θ,ˆθ) (8)The Bayes risk under prior f(θ) isRBayesf(ˆθ) =ZR(θ,ˆθ)f(θ)dθ. (9)Accordingly, two different criteria to define “the best estimator” (or the best learning algorithm) is theBayes rule and the minimax rule, respectively. An estimatorˆθBayesis a Bayes rule with respect to the priorf ifˆθBayes= arg infˆθZR(θ,ˆθ)f(θ)dθ, (10)where the infimum is over all estimatorsˆθ. An estimatorˆθminimaxthat minimizes the maximum risk is aminimax rule:ˆθminimax= arg infˆθsupθR(θ,ˆθ), (11)where again the infimum is over all estimatorsˆθ.We list the following theorems without proof. For details see AoS p.197.Theorem 1 Let f(θ) be a prior, x a sample, and f(θ | x) the corresponding posterior. If L(θ,ˆθ) = (θ −ˆθ)2then the Bayes rule is the posterior mean:ˆθBayes(x) =Zθf(θ | x)dθ = E(θ | X = x). (12)If L(θ,ˆθ) = |θ −ˆθ| then the Bayes rule is the posterior median. If L(θ,ˆθ) is zero-one loss then the Bayesrule is the posterior mode.Theorem 2 Suppose thatˆθ is the Bayes rule with respect to some prior f . Suppose further thatˆθ has aconstant risk: R(θ,ˆθ) = c for all θ ∈ Θ. Thenˆθ is minimax.Statistical Decision Theory 3Example 3 In example 2 we made the choice α = β =√n/2 so that the risk R(θ,ˆθ2) =14(√n+1)2isa constant. Also,ˆθ2is the posterior mean and hence by Theorem 1 is a Bayes rule under the priorBeta(√n/2,√n/2). Putting them together, by Theorem 2ˆθ2is
View Full Document