DOC PREVIEW
UW-Madison CS 731 - Basics of Statistical Machine Learning

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Parametric vs. Nonparametric Statistical ModelsEstimationMaximum LikelihoodBayesian InferenceCS731 Spring 2011 Advanced Artificial IntelligenceBasics of Statistical Machine LearningLecturer: Xiaojin Zhu [email protected] machine learning is rooted in statistics. You will find many familiar concepts here with a differentname.1 Parametric vs. Nonparametric Statistical ModelsA statistical model H is a set of distributions.A parametric model is one that can be parametrized by a finite number of parameters. We write thePDF f (x) = f(x; θ) to emphasize the parameter θ ∈ Rd. In general,H =f(x; θ) : θ ∈ Θ ⊂ Rd(1)where Θ is the parameter space. We will often use the notationEθ(g) =Zxg(x)f(x; θ) dx (2)to denote the expectation of a function g with respect to f(x; θ). Note the subscript in Eθdoes not meanintegrating over all θ.Example 1 Consider the parametric model H = {N(µ, 1) : µ ∈ R}. Given iid data x1, . . . , xn, the optimalestimator of the mean is bµ =1nPxi.A nonparametric model is one which cannot b e parametrized by a fixed numb e r of parameters.Example 2 Consider the nonparametric model H = {P : V arP(X) < ∞}. Given iid data x1, . . . , xn, theoptimal estimator of the mean is again bµ =1nPxi.Example 3 In a naive Bayes classifier we are interested in computing the conditional p(y|x; θ) ∝ p(y; θ)Qdip(xi|y; θ).Is this a parametric or nonparametric model? The model is specified by H = {p(x, y; θ)} where θ containsthe parameter for the class prior multinomial distribution p(y) (finite number of parameters), and the classconditional distributions p(xi|y) for each dimension. The latter can be parametric (such as a multinomialover the vocabulary, or a Gaussian), or non parametric (such as 1D kernel density est imation). Therefore,naive Bayes can be either parametric or nonparametric, although in practice the former is more common.In machine learning we are often interested in a function of the distribution T (F ), for example, the mean.We call T the statistical functional, viewing F the distribution itself a function of x. However, we will alsoabuse the notation and say θ = T (F ) is a “parameter” even for nonparametric models.2 EstimationGiven X1. . . Xn∼ F ∈ H, an estimatorbθnis any function of X1. . . Xnthat attempts to estimate a parameterθ.An estimator is consistent ifbθnP→ θ. (3)1Basics of Statistical Machine Learning 2Becausebθnis a random variable, we can talk about its expectation:Eθ(bθn) (4)where Eθis w.r.t. the joint distribution f(x1, . . . , xn; θ) =Qni=1f(xi; θ). Then, the bias of the estimator isbias(bθn) = Eθ(bθn) − θ. (5)An estimator is unbiased if bias(bθn) = 0. The standard error of an estimator isse(bθn) =qVarθ(bθn). (6)The mean squared error of an estimator ismse(bθn) = Eθ(bθn− θ)2. (7)Theorem 1 mse(bθn) = bias2(bθn) + se2(bθn) = bias2(bθn) + Varθ(bθn).3 Maximum LikelihoodFor parametric statistical models, a common estimator is the maximum likelihood estimator. Let x1, . . . , xnbe iid with PDF f(x; θ) where θ ∈ Θ. The likelihood function isLn(θ) = f(x1, . . . , xn; θ) =nYi=1f(xi; θ). (8)The log likelihood function is `n(θ) = log Ln(θ). The maximum likelihood estimator (MLE) isbθn= argmaxθ∈ΘLn(θ) = argmaxθ∈Θ`n(θ). (9)Example 4 The MLE for p(head) from n coin flips is count(head)/n, sometimes called “estimating prob-ability by the frequency.” This is also true for multinomials. The MLE for X1, . . . , XN∼ N(µ, σ2) isbµ = 1/nPiXiand bσ2= 1/nP(Xi− bµ)2. These agree with our intuition. However, the MLE does notalways agree with intuition. For example, the MLE for X1, . . . , Xn∼ uniform(0, θ) isbθ = max(X1, . . . , Xn).You would think θ is larger, no?The MLE has several nice properties. The Kullback-Leibler divergence between two PDFs isKL(f kg) =Zf(x) logf(x)g(x)dx. (10)The model H is identifiable if ∀θ, ψ ∈ Θ, θ 6= ψ implies KL(f (x; φ)kf(x; ψ)) > 0. That is, differentparameters correspond to different PDFs.Theorem 2 When H is identifiable, under certain conditions (see Wasserman Theorem 9.13), the MLEbθnP→ θ∗, where θ∗is the true value of the parameter θ. That is, the MLE is consistent.Given n iid observations, the Fisher information is defined asIn(θ) = nEθ"∂∂θln f(X; θ)2#= −nEθ∂2∂θ2ln f(X; θ)(11)Basics of Statistical Machine Learning 3Example 5 Consider n iid observations xi∈ {0, 1} from a Bernoulli distribution with true parameter p.f(x; p) = px(1 − p)1−x. It follows that∂2∂θ2ln f(X; θ), evaluated at p, is −x/p2− (1 − x)/(1 − p)2. Takingthe expectation over x under f(x; p) and multiply by −n, we arrive at In(p) =np(1−p).Theorem 3 (Asymptotic Normality of the MLE). Let se =qV arθ(bθn). Under appropriate regularityconditions, se ≈p1/In(θ), andbθn− θse N (0, 1). (12)Furthermore, let bse =q1/In(bθn). Thenbθn− θbse N (0, 1). (13)Theorem 4 (Cram´er-Rao Lower Bound) Letbθnbe any unbiased estimator (not necessarily the MLE) of θ.Then the variance is lower bounded by the inverse Fisher information:V arθ(bθn) ≥1In(θ). (14)The Fisher information can be generalized to the high dimensional case. Let θ be a parameter vector.The Fisher information matrix has i, jth elementIij(θ) = −E∂2ln f(X; θ)∂θi∂θj. (15)An unbiased estimator that achieves the Cram´er-Rao lower bound is said to be efficient. It is asymptot-ically efficient if it achieves the bound as n → ∞.Theorem 5 The MLE is asymptotically efficient.4 Bayesian InferenceThe statistical methods discussed so far are frequentist methods:• Probability refers to limiting relative frequency.• Data are random.• Estimators are random because they are functions of data.• Parameters are fixed, unknown constants not subject to probabilistic statements.• Procedures are subject to probabilistic statements, for exam ple 95% confidence intervals traps the trueparameter value 95An alternative is the Bayesian approach:• Probability refers to degree of b elief.• Inference about a parameter θ is by producing a probability distributions on it. Typically, one startswith a prior distribution p(θ). One also chooses a likelihood function p(x | θ) – note this is a functionof θ, not x. After observing data x, one applies the Bayes Theorem to obtain the posterior distributionp(θ | x):p(θ | x) =p(θ)p(x | θ)Rp(θ0)p(x | θ0)dθ0∝ p(θ)p(x | θ), (16)Basics of


View Full Document

UW-Madison CS 731 - Basics of Statistical Machine Learning

Download Basics of Statistical Machine Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Basics of Statistical Machine Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Basics of Statistical Machine Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?