Pattern ClassificationAll materials in these slides were taken fromPattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisherChapter 3:Maximum-Likelihood & Bayesian Parameter Estimation ●Introduction●Maximum-Likelihood Estimation●Bayesian Estimation●Curse of Dimensionality●Component analysis & Discriminants●EM AlgorithmPattern Classification, Chapter 32● Introduction● Bayesian framework●We could design an optimal classifier if we knew:●P(ωi) : priors●P(x | ωi) : class-conditional densitiesUnfortunately, we rarely have this complete information!● Design a classifier based on a set of labeled training samples (supervised learning)●Assume priors are known●Need sufficient no. of training samples for estimating class-conditional densities, especially when the dimensionality of the feature space is large1Pattern Classification, Chapter 33● Assumption about the problem: parametric model of P(x | ωi) is available● Normality of P(x | ωi)P(x | ωi) ~ N( µi, Σi)●Characterized by 2 parameters● Estimation techniques●Maximum-Likelihood (ML) and Bayesian estimation●Results of the two procedures are nearly identical, but the approaches are different1Pattern Classification, Chapter 34●Parameters in ML estimation are fixed but unknown! Bayesian parameter estimation procedure, by its nature, utilizes whatever prior information is available about the unknown parameter●MLE: Best parameters are obtained by maximizing the probability of obtaining the samples observed●Bayesian methods view the parameters as random variables having some known prior distribution; How do we know the priors?●In either approach, we use P(ωi| x) for our classification rule!1Pattern Classification, Chapter 35● Maximum-Likelihood Estimation●Has good convergence properties as the sample size increases; estimated parameter value approaches the true value as n increases●Simpler than any other alternative technique● General principle●Assume we have c classes andP(x | ωj) ~ N( µj, Σj)P(x | ωj) ≡ P (x | ωj, θj), where)...)x,xcov(,,,...,,(),(njmj22j11j2j1jjjσσσσσσσσµµµµµµµµ====ΣΣΣΣµµµµ====θθθθ2Use class ωjsamples to estimate class ωj parametersPattern Classification, Chapter 36●Use the information in training samples to estimate θ = (θ1, θ2, …, θc); θi(i = 1, 2, …, c) is associated with the ith category●Suppose sample set D contains n iid samples, x1, x2,…, xn●ML estimate of θ is, by definition, the value that maximizes P(D | θ)“It is the value of θ that best agrees with the actually observed training samples”samples) ofset the w.r.t. of likelihood the called is )|D(P)(F)|x(P)|D(Pnk1kkθθθθθθθθθθθθ====θθθθ∏∏∏∏====θθθθ========θθθθˆ2Pattern Classification, Chapter 372Pattern Classification, Chapter 38●Optimal estimation●Let θ = (θ1, θ2, …, θp)tand ∇θbe the gradient operator●We define l(θ) as the log-likelihood functionl(θ) = ln P(D | θ)●New problem statement:determine θ that maximizes the log-likelihoodtp21,...,,θθθθ∂∂∂∂∂∂∂∂θθθθ∂∂∂∂∂∂∂∂θθθθ∂∂∂∂∂∂∂∂====∇∇∇∇θθθθ)(lmaxargˆθθθθ====θθθθθθθθ2Pattern Classification, Chapter 39Set of necessary conditions for an optimum is:∇θl = 0))|x(Plnl(knk1kθθθθ∑∑∑∑∇∇∇∇====∇∇∇∇========θθθθθθθθ2Pattern Classification, Chapter 310●Example of a specific case: unknown µ●P(x | µ) ~ N(µ, Σ)(Samples are drawn from a multivariate normal population)θ = µ, therefore the ML estimate for µ must satisfy:[[[[]]]])x()|x(Pln and)x()x(21)2(ln21)|x(Pln1kk1ktkdkµµµµ−−−−∑∑∑∑====µµµµ∇∇∇∇µµµµ−−−−∑∑∑∑µµµµ−−−−−−−−ΣΣΣΣππππ−−−−====µµµµ−−−−θµθµθµθµ−−−−0)ˆx(knk1k1====µµµµ−−−−∑∑∑∑ΣΣΣΣ========−−−−2Pattern Classification, Chapter 311•Multiplying by Σ and rearranging, we obtain:which is the arithmetic average or the mean of the samples of the training samples!Conclusion: Given P(xk| ωj), j = 1, 2, …, c to be Gaussian in a d-dimensional feature space, estimate the vector θ = (θ1, θ2, …, θc)tand perform a classification using the Bayes decision rule of chapter 2!∑∑∑∑====µµµµ========nk1kkxn1ˆ2Pattern Classification, Chapter 312●ML Estimation: ● Univariate Gaussian Case: unknown µ& σθ = (θ1, θ2) = (µ, σ2)====θθθθθθθθ−−−−++++θθθθ−−−−====θθθθ−−−−θθθθ====θθθθσθσθσθσθσσσσθθθθσθσθσθσθσσσσ====∇∇∇∇θθθθ−−−−θθθθ−−−−πθπθπθπθ−−−−====θθθθ====θθθθ02)x(210)x(10))|x(P(ln))|x(P(lnl)x(212ln21)|x(Plnl2221k21k2k2k121k22k2Pattern Classification, Chapter 313Summation:Combining (1) and (2), one obtains:n)x( ; nxnk1k2k2nk1kk∑∑∑∑µµµµ−−−−====σσσσ∑∑∑∑====µµµµ================∑∑∑∑ ∑∑∑∑====θθθθθθθθ−−−−++++θθθθ−−−−∑∑∑∑====θθθθ−−−−θθθθ========================nk1knk1k2221k2nk1k1k2(2) 0ˆ)ˆx(ˆ1(1) 0)x(ˆ12Pattern Classification, Chapter 314●Bias● ML estimate for σ2is biased● An unbiased estimator for Σ is:222i.n1n)xx(n1E σσσσ≠≠≠≠σσσσ−−−−====−−−−ΣΣΣΣ matrix covariance Samplenk1ktkk)ˆx)(x(1-n1C∑∑∑∑µµµµ−−−−µµµµ−−−−============2Pattern Classification, Chapter 115●Bayesian Estimation (Bayesian learning approach for pattern classification problems)● In MLE θ was supposed to have a fixed value● In BE θ is a random variable● The computation of posterior probabilities P(ωi| x) lies at the heart of Bayesian classification● Goal: compute P(ωi| x, D) Given the training sample set
View Full Document