UCLA STAT 231 - MLE, Bayes Learning, and Maximum Entropy - D1664266

Home> Schools> University of California, Los Angeles> (STAT) > STAT 231> MLE, Bayes Learning, and Maximum Entropy

DOC PREVIEW

UCLA STAT 231 - MLE, Bayes Learning, and Maximum Entropy

School name University of California, Los Angeles

Course Stat 231- Pattern Recognition and Machine Learning

Pages 11

This preview shows page 1-2-3-4 out of 11 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Lecture note for Stat 231: Pattern Recognition and Machine LearningLecture 3: MLE, Bayes Learning, and Maximum EntropyObjective : Learning the prior and class models, both the parameters and theformulation (forms), from training data for classification.1. Introduction to some general concepts.2. Maximum likelihood estimation (MLE)3. Recursive Bayes learning4. Maximum entropy principleLecture note for Stat 231: Pattern Recognition and Machine LearningLearning by MLEIn Bayesian decision theory, we construct an optimal decision rule with the assumption that theprior and class conditional probabilities are known. In this lecture, we move one step furtherand study how we may learn these probabilities from training data.Given: a set of training data with labels D={ (x1,c1), (x2,c2), …, (xN, cN) }Goal: to estimate (learn) the prior p(wi) and conditional probabilities p(x|wi), i=1,2,…,k.Basic assumption here:1). There is an underlying frequency f(x,w) for variables x and w jointly.the training data are independent samples from f(x,w).2). We assume that we know the probability family for p(x|wi), i=1,2,…,k. Each family is specifiedby a vector valued parameter θ. --- parametric method.3). The different class of models can be learned independently. E.g. no correlation between salmon and sea bass in the training data.2Lecture note for Stat 231: Pattern Recognition and Machine LearningTerminology clarification1. Supervised vs unsupervised learning:In supervised learning, the data are labeled manually. In unsupervised learning, the computer will have to discover the number of classes, and to label the data and estimate the class models in an iterative way.2. Parametric methods vs non-parametric methods:In a parametric method, the probability model is specified by a number of parameter with more or less fixed length. For example, Gaussian distribution.In a non-parametric method, the probability model is often specified by the samples themselves.If we treat them as parameters, the number of parameters often increases linearly with the size of the training data set |D|.3. Frequency vs probability (model):For a learning problem, we always assume that there exists an underlying frequency f(x) which is objective and intrinsic to the problem domain. For example the fish length distribution for salmon in Alaska. But it is not directly observable and we can only draw finite set of samples from it.In contrast, what we have in practice is a probability p(x) estimation to f(x) based on the finite data.This is called a “model”. A model is subjective and approximately true, as it depends on our experience (data), purpose, and choice of models. “All models are wrong, but some are useful”.Lecture note for Stat 231: Pattern Recognition and Machine LearningProblem formulationFor clarity of notation, we remove the class label, and estimate each class model separately.Given: A set of training data D={ x1, x2,…, xN} as independent samples from f(x) for a class wiObjective: Learning a model p(x) from D as an estimation of f(x).Assumption: p(x) is from a probability family specified by parameters θ. Denote p(x) by p(x; θ), and the family by ΩθThus the objective is to estimate θ.Formulation: We choose θ to minimize a “distance measure” between f(x) and p(x; θ),∫Ω∈= dxxpxfxf);()(log)(minarg*θθθθThis is called the Kullback-Leibler divergence in information theory. You may choose other distance measure, such as,But the KL divergence has many interpretations and is easy to compute, so people fall in love with it. ∫− dxxpxf2|);()(|θ3Lecture note for Stat 231: Pattern Recognition and Machine LearningMaximum Likelihood Estimate (MLE))();(logmaxarg)];([logmaxarg)];([log)]([logminarg);()(log)(minarg*N1iiNfffOxpxpExpExfEdxxpxfxfεθθθθθθθθθθθθθ+==−==∑∫=Ω∈Ω∈Ω∈Ω∈The above formulation basically gives us an explanation for the popular MLEIn the last step, we replace the expectation (mean) by a sample mean.The MLE is to find the “best” parameter to maximize the likelihood of the data:∑=Ω∈=N1ii);(logmaxarg*θθθθxpIn fact, you should remember that nearly all learning problems start from this formulation !Lecture note for Stat 231: Pattern Recognition and Machine LearningMLE example∑==N1ii);(log)(θθxplWe denote the log-likelihood as a function of θθ∗ is computed by solving equations0)(=θθddlFor example, the Gaussian familygives close form solution.4Lecture note for Stat 231: Pattern Recognition and Machine LearningMLE SummaryThe MLE computes one point in the probability family Ωθf(x)p(x;θ∗)θ∗ΩθIt treats θ as a quantity. The problems are:1). It does not preserve the full uncertainty (for example, see the figurein previous page) in estimating θ. 2). It is difficult to integrate with new data incrementally. For example, if new dataarrive after the MLE, how do we update θ?Lecture note for Stat 231: Pattern Recognition and Machine LearningBayes LearningThe Bayes learning method takes a different view from MLE. It views θ as a random variable, and thus it estimates a probability distribution of θ. Now, we denote the class probability as p(x | θ),in contrast to p(x; θ). Instead of computing a single θ*, we compute the posterior probability fromthe data set D. As the samples in D are independent, we have)(/)()|()(/)()|()|(1DppxpDppDpDpiNiθθθθθ∏===In the above equation, p(θ) is a prior distribution for q. In the absence of a priori knowledge on q, we can set it to be a uniform distribution. It is trivial to show that the MLE θ* is also a maximum posterior estimation.5Lecture note for Stat 231: Pattern Recognition and Machine LearningRecursive Bayes LearningSuppose that we observe new data set Dnew={xn+1, …, xn+m} after learning the posterior p(θ|D), we can treat p(θ|D) as our prior model and compute ),(/)()|()(/)|()|()(/)|(),|(),|(11DDppxpDpDpxpDpDpDDpDDpnewimNinewimNinewnewnewθθθθθθθ∏∏+=+====In the above equations, p(D) and p(Dnew) are treated as constants.Clearly, it is equivalent to MLE by pooling the two datasets D and Dnew. Therefore when the data come in batches, we can recursively apply the Bayes rule to learning the posterior probability on θ. Obviously the posterior becomes sharper and shaper when the number the samples increases.Lecture note for Stat 231: Pattern Recognition and Machine LearningRecursive Bayes Learning6Lecture note for

View Full Document