UW-Madison CS 731 - Basics of Statistical Machine Learning - D402866

Home> Schools> University of Wisconsin, Madison> (CS) > CS 731> Basics of Statistical Machine Learning

DOC PREVIEW

UW-Madison CS 731 - Basics of Statistical Machine Learning

School name University of Wisconsin, Madison

Course Cs 731- Advanced Artificial Intelligence (

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Parametric vs. Nonparametric Statistical ModelsEstimationMaximum LikelihoodBayesian InferenceCS731 Spring 2011 Advanced Artificial IntelligenceBasics of Statistical Machine LearningLecturer: Xiaojin Zhu [email protected] machine learning is rooted in statistics. You will find many familiar concepts here with a differentname.1 Parametric vs. Nonparametric Statistical ModelsA statistical model H is a set of distributions.A parametric model is one that can be parametrized by a finite number of parameters. We write thePDF f (x) = f(x; θ) to emphasize the parameter θ ∈ Rd. In general,H =f(x; θ) : θ ∈ Θ ⊂ Rd(1)where Θ is the parameter space. We will often use the notationEθ(g) =Zxg(x)f(x; θ) dx (2)to denote the expectation of a function g with respect to f(x; θ). Note the subscript in Eθdoes not meanintegrating over all θ.Example 1 Consider the parametric model H = {N(µ, 1) : µ ∈ R}. Given iid data x1, . . . , xn, the optimalestimator of the mean is bµ =1nPxi.A nonparametric model is one which cannot b e parametrized by a fixed numb e r of parameters.Example 2 Consider the nonparametric model H = {P : V arP(X) < ∞}. Given iid data x1, . . . , xn, theoptimal estimator of the mean is again bµ =1nPxi.Example 3 In a naive Bayes classifier we are interested in computing the conditional p(y|x; θ) ∝ p(y; θ)Qdip(xi|y; θ).Is this a parametric or nonparametric model? The model is specified by H = {p(x, y; θ)} where θ containsthe parameter for the class prior multinomial distribution p(y) (finite number of parameters), and the classconditional distributions p(xi|y) for each dimension. The latter can be parametric (such as a multinomialover the vocabulary, or a Gaussian), or non parametric (such as 1D kernel density est imation). Therefore,naive Bayes can be either parametric or nonparametric, although in practice the former is more common.In machine learning we are often interested in a function of the distribution T (F ), for example, the mean.We call T the statistical functional, viewing F the distribution itself a function of x. However, we will alsoabuse the notation and say θ = T (F ) is a “parameter” even for nonparametric models.2 EstimationGiven X1. . . Xn∼ F ∈ H, an estimatorbθnis any function of X1. . . Xnthat attempts to estimate a parameterθ.An estimator is consistent ifbθnP→ θ. (3)1Basics of Statistical Machine Learning 2Becausebθnis a random variable, we can talk about its expectation:Eθ(bθn) (4)where Eθis w.r.t. the joint distribution f(x1, . . . , xn; θ) =Qni=1f(xi; θ). Then, the bias of the estimator isbias(bθn) = Eθ(bθn) − θ. (5)An estimator is unbiased if bias(bθn) = 0. The standard error of an estimator isse(bθn) =qVarθ(bθn). (6)The mean squared error of an estimator ismse(bθn) = Eθ(bθn− θ)2. (7)Theorem 1 mse(bθn) = bias2(bθn) + se2(bθn) = bias2(bθn) + Varθ(bθn).3 Maximum LikelihoodFor parametric statistical models, a common estimator is the maximum likelihood estimator. Let x1, . . . , xnbe iid with PDF f(x; θ) where θ ∈ Θ. The likelihood function isLn(θ) = f(x1, . . . , xn; θ) =nYi=1f(xi; θ). (8)The log likelihood function is `n(θ) = log Ln(θ). The maximum likelihood estimator (MLE) isbθn= argmaxθ∈ΘLn(θ) = argmaxθ∈Θ`n(θ). (9)Example 4 The MLE for p(head) from n coin flips is count(head)/n, sometimes called “estimating prob-ability by the frequency.” This is also true for multinomials. The MLE for X1, . . . , XN∼ N(µ, σ2) isbµ = 1/nPiXiand bσ2= 1/nP(Xi− bµ)2. These agree with our intuition. However, the MLE does notalways agree with intuition. For example, the MLE for X1, . . . , Xn∼ uniform(0, θ) isbθ = max(X1, . . . , Xn).You would think θ is larger, no?The MLE has several nice properties. The Kullback-Leibler divergence between two PDFs isKL(f kg) =Zf(x) logf(x)g(x)dx. (10)The model H is identifiable if ∀θ, ψ ∈ Θ, θ 6= ψ implies KL(f (x; φ)kf(x; ψ)) > 0. That is, differentparameters correspond to different PDFs.Theorem 2 When H is identifiable, under certain conditions (see Wasserman Theorem 9.13), the MLEbθnP→ θ∗, where θ∗is the true value of the parameter θ. That is, the MLE is consistent.Given n iid observations, the Fisher information is defined asIn(θ) = nEθ"∂∂θln f(X; θ)2#= −nEθ∂2∂θ2ln f(X; θ)(11)Basics of Statistical Machine Learning 3Example 5 Consider n iid observations xi∈ {0, 1} from a Bernoulli distribution with true parameter p.f(x; p) = px(1 − p)1−x. It follows that∂2∂θ2ln f(X; θ), evaluated at p, is −x/p2− (1 − x)/(1 − p)2. Takingthe expectation over x under f(x; p) and multiply by −n, we arrive at In(p) =np(1−p).Theorem 3 (Asymptotic Normality of the MLE). Let se =qV arθ(bθn). Under appropriate regularityconditions, se ≈p1/In(θ), andbθn− θse N (0, 1). (12)Furthermore, let bse =q1/In(bθn). Thenbθn− θbse N (0, 1). (13)Theorem 4 (Cram´er-Rao Lower Bound) Letbθnbe any unbiased estimator (not necessarily the MLE) of θ.Then the variance is lower bounded by the inverse Fisher information:V arθ(bθn) ≥1In(θ). (14)The Fisher information can be generalized to the high dimensional case. Let θ be a parameter vector.The Fisher information matrix has i, jth elementIij(θ) = −E∂2ln f(X; θ)∂θi∂θj. (15)An unbiased estimator that achieves the Cram´er-Rao lower bound is said to be efficient. It is asymptot-ically efficient if it achieves the bound as n → ∞.Theorem 5 The MLE is asymptotically efficient.4 Bayesian InferenceThe statistical methods discussed so far are frequentist methods:• Probability refers to limiting relative frequency.• Data are random.• Estimators are random because they are functions of data.• Parameters are fixed, unknown constants not subject to probabilistic statements.• Procedures are subject to probabilistic statements, for exam ple 95% confidence intervals traps the trueparameter value 95An alternative is the Bayesian approach:• Probability refers to degree of b elief.• Inference about a parameter θ is by producing a probability distributions on it. Typically, one startswith a prior distribution p(θ). One also chooses a likelihood function p(x | θ) – note this is a functionof θ, not x. After observing data x, one applies the Bayes Theorem to obtain the posterior distributionp(θ | x):p(θ | x) =p(θ)p(x | θ)Rp(θ0)p(x | θ0)dθ0∝ p(θ)p(x | θ), (16)Basics of

View Full Document