DOC PREVIEW
UW-Madison CS 731 - Statistical Decision Theory

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS731 Spring 2011 Advanced Artificial IntelligenceStatistical Decision TheoryLecturer: Xiaojin Zhu [email protected] a parameter θ ∈ Θ. We observe data x sampled from the distribution parametrized by θ. Letˆθ ≡ˆθ(x) be an estimator of θ based on data x. We are going to compare different estimators.Let a loss function L(θ,ˆθ) : Θ × Θ 7→ R+be defined. For example,L(θ,ˆθ) = (θ −ˆθ)2(1)L(θ,ˆθ) =0 θ =ˆθ1 θ 6=ˆθ(2)L(θ,ˆθ) =Zp(x; θ) log p(x; θ)p(x;ˆθ)!dx (3)The risk R(θ,ˆθ) is the average loss , averaged over training sets sampled from the true θ:R(θ,ˆθ) = Eθ[L(θ,ˆθ(x))] =Zp(x; θ)L(θ,ˆθ(x))dx (4)Recall that Eθmeans the expectation over x drawn from the distribution with fixed parameter θ, not theexpectation over different θ.Example 1 Let X ∼ N(θ, 1). Letˆθ1= X andˆθ2= 3.14. Assume squared error loss. Then R(θ,ˆθ1) = 1(hint: variance), R(θ,ˆθ2) = Eθ(θ −3.14)2= (θ −3.14)2. (hint: no X here) Over the whole range of possibleθ ∈ R, neither estimator consistently dominates.Example 2 Let X1, . . . , Xn∼ Bernoulli(θ). Consider squared error loss. Letˆθ1=PXin, the sample mean.Letˆθ2=α+PXiα+β+nwhich is the “smoothed” estimate, i.e., the posterior mean under a Beta(α, β) prior. Letˆθ3= X1, the first sample. Then, R(θ,ˆθ1) = V(PXin) =θ(1−θ)nand R(θ,ˆθ3) = V(X1) = θ(1 − θ). Soˆθ3isout as a learning algorithm. But what aboutˆθ2?R(θ,ˆθ2) = Eθ(θ −ˆθ2)2(5)= Vθ(ˆθ2) + (bias(ˆθ2))2(6)=nθ(1 − θ)(n + α + β)2+nθ + αn + α + β− θ2(7)It is not difficult to show that one can make θ disappear from the risk (i.e., task insensitivity) by settingα = β =√n/2withR(θ,ˆθ2) =14(√n + 1)2It turns out this particular choice of α, β leads to a so-called minimax estimatorˆθ2, as we will show later.However, there is no dominance betweenˆθ1andˆθ2as the figure below shows:1Statistical Decision Theory 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100.050.10.150.20.25θ hatθ1 risk, n=1hatθ2 risk, n=1hatθ1 risk, n=10hatθ2 risk, n=10hatθ1 risk, n=30hatθ2 risk, n=30The maximum risk isRmax(ˆθ) = supθR(θ,ˆθ) (8)The Bayes risk under prior f(θ) isRBayesf(ˆθ) =ZR(θ,ˆθ)f(θ)dθ. (9)Accordingly, two different criteria to define “the best estimator” (or the best learning algorithm) is theBayes rule and the minimax rule, respectively. An estimatorˆθBayesis a Bayes rule with respect to the priorf ifˆθBayes= arg infˆθZR(θ,ˆθ)f(θ)dθ, (10)where the infimum is over all estimatorsˆθ. An estimatorˆθminimaxthat minimizes the maximum risk is aminimax rule:ˆθminimax= arg infˆθsupθR(θ,ˆθ), (11)where again the infimum is over all estimatorsˆθ.We list the following theorems without proof. For details see AoS p.197.Theorem 1 Let f(θ) be a prior, x a sample, and f(θ | x) the corresponding posterior. If L(θ,ˆθ) = (θ −ˆθ)2then the Bayes rule is the posterior mean:ˆθBayes(x) =Zθf(θ | x)dθ = E(θ | X = x). (12)If L(θ,ˆθ) = |θ −ˆθ| then the Bayes rule is the posterior median. If L(θ,ˆθ) is zero-one loss then the Bayesrule is the posterior mode.Theorem 2 Suppose thatˆθ is the Bayes rule with respect to some prior f . Suppose further thatˆθ has aconstant risk: R(θ,ˆθ) = c for all θ ∈ Θ. Thenˆθ is minimax.Statistical Decision Theory 3Example 3 In example 2 we made the choice α = β =√n/2 so that the risk R(θ,ˆθ2) =14(√n+1)2isa constant. Also,ˆθ2is the posterior mean and hence by Theorem 1 is a Bayes rule under the priorBeta(√n/2,√n/2). Putting them together, by Theorem 2ˆθ2is


View Full Document

UW-Madison CS 731 - Statistical Decision Theory

Download Statistical Decision Theory
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Statistical Decision Theory and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Statistical Decision Theory 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?