1 Linear Classification: Probabilistic Generative Models Sargur N. Srihari University at Buffalo, State University of New York USA Machine Learning SrihariLinear Classification using Probabilistic Generative Models • Topics 1. Overview (Generative vs Discriminative) 2. Bayes Classifier • using Logistic Sigmoid and Softmax 3. Continuous inputs • Gaussian Distributed Class-conditionals – Parameter Estimation 4. Discrete Features 5. Exponential Family 2 Machine Learning SrihariMachine Learning Srihari 3 Overview of Methods for Classification 1. Generative Models (Two-step) 1. Infer class-conditional densities p(x|Ck) and priors p(Ck) 2. Use Bayes theorem to determine posterior probabilities 2. Discriminative Models (One-step) – Directly infer posterior probabilities p(Ck|x) • Decision Theory – In both cases use decision theory to assign each new x to a classGenerative Model • Model class conditionals p(x|Ck) and priors p(Ck) • Compute posteriors p(Ck|x) from Bayes theorem • Two class Case • Posterior for class C1 is 4 Since LLR with Bayes odds Machine Learning Srihari p(x) = p(x,Ci)i∑= p(x/Ci)i∑p(Ci)Logistic Sigmoid Function Sigmoid: “S”-shaped or squashing function maps real a ε (-∞, +∞) to finite (0,1) interval Note: Dotted line is scaled probit function cdf of a zero-mean unit variance Gaussian σ(a) =11+ exp(−a)Property :σ(−a) = 1−σ(a)Inverse : a = lnσ1−σ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ If then Inverse represents ln[p(C1|x)/p(C2|x) Log ratio of probabilities called logit or log odds a Machine Learning 5 Srihari σ(a)σ(a) = P(C1| x)Generalizations and Special Cases • More than 2 classes • Gaussian Distribution of x • Discrete Features • Exponential Family Machine Learning Srihari 6Softmax: Generalization of logistic sigmoid • For K=2 we have logistic sigmoid • For K > 2, we have – Quantities ak are defined by ak=ln p(x|Ck)p(Ck) • Known as the soft-max function – Since it is a smoothed max function • If ak>>aj for all j ≠ k then p(Ck|x) =1 and 0 for rest • A general technique for finding max of several ak p(Ck| x) =p(x | Ck) p(Ck)p(x | Cj) p(Cj)j∑ =exp(ak)exp(aj)j∑If K=2 this reduces to earlier form p(C1/x)=exp(a1)/[exp(a1)+exp(a2)] =1/[1+exp(a2-a1)] =1/[1+exp(lnp(x/C2)p(C2)-ln(x/C1)p(C1)] =1/[1+p(x/C2)p(C2)/p(x/C1)p(C1)] =1/[1+exp(-a)] where a = lnp(x | C1) p(C1)p(x | C2) p(C2)Machine Learning 7 SrihariGenerative Model with Gaussian Input • Gaussian class-conditional densities with same covariance matrix • Two-class case – where – Quadratic terms in x cancel due to common covariance!– A linear function of x in argument of logistic sigmoid! p(x | Ck) =1(2π)D / 21Σ1/ 2exp −12(x −µk)TΣ−1(x −µk)⎧ ⎨ ⎩ ⎫ ⎬ ⎭ p(C1| x) =σlnp(x | C1) p(C1)p(x | C2) p(C2)⎛ ⎝ ⎜ ⎞ ⎠ ⎟ =σ(wTx + w0)Machine Learning Srihari w = Σ−1(µ1−µ2) w0= −12µ1TΣ−1µ1+12µ2TΣ−1µ2+ lnp(C1)p(C2)Two Gaussian Classes 9 Values are positive (need not sum to 1) A logistic sigmoid of a linear function of x Red ink proportional to p(C1/x) Blue ink to p(C2/x)=1-p(C1/x) Value 1 or 0 Two-dimensional input space x=(x1,x2) Linear Decision boundary Class-conditional densities p(x|Ck) Machine Learning Srihari Posterior probability p(C1|x)Continuous case with K >2 • With Gaussian class conditionals – where – If we did not assume shared covariance matrix we get a quadratic discriminant Machine Learning Srihari 10 p(Ck| x) =p(x | Ck) p(Ck)p(x | Cj) p(Cj)j∑ =exp(ak)exp(aj)j∑ ak(x) = wkTx + wk 0 wk= Σ−1µk wk0= −12µkTΣ−1µk+ ln p(Ck)Quadratic terms cancel thereby leading to linearityThree-class case with Gaussian models 11 Class-conditional Densities C1 and C2 have same covariance matrix Posterior Probabilities Between C1 and C2 boundary is linear, Others are quadratic RGB values correspond to posterior probabilities Machine Learning Srihari Both Linear and Quadratic Decision boundariesM.L.E. for Gaussian Parameters • Assuming parametric forms for p(x|Ck) we can determine values of parameters and priors p(Ck) using maximum likelihood Machine Learning Srihari 12 where t =(t1,..,tN)T Convenient to maximize log of likelihoodMax Likelihood for Prior and Means 13 MLE for p is Fraction of points Machine Learning Srihari Mean of all input vectors xn assigned to class C1 Estimates for prior probabilities Estimates for class meansMax Likelihood for Covariance Matrix 14 Solution for Shared Covariance Matrix Pick out terms in log-likelihood function depending on !Weighted average of the two separate covariance matrices Machine Learning Srihari ΣDiscrete Features p(x | Ck) =µkixii=1D∏(1−µki)1−xi ak(x) = ln( p(x | Ck) p(Ck)) = xilnµki+ (1− xi)ln(1−µki{ }+ ln p(Ck)i=1D∑Assuming binary features With D inputs, distribution is a table of 2D values Naive Bayes assumption: independent features Class-conditional distributions have the form which is linear in x Similar results for discrete variables which take M>2 values Substituting in the form needed for normalized exponential Machine Learning Srihari xi∈{0,1}Exponential Family • Class-conditionals that belong to the exponential family have the general form – Where are natural parameters of the distribution, u(x) is a function of x and g( ) is a coefficient that ensures distribution is normalized – Bernoulli, Multinomial and Gaussian belong • For K > 2 – we get – which is linear in x 16 p(x |η) = h(x)g(η)expηTu(x){ } p(x |λk) = h(x)g(λk)expλkTu(x){ } ak(x) =λkTx + ln g(λk) + ln p(Ck)Machine Learning Srihari ηηSummary of probabilistic linear classifiers • Defined using – logistic sigmoid – soft-max functions • Continuous case with shared covariance – we get linear functions of input x • Discrete case with independent features also results in linear functions Machine Learning Srihari 17 P(C1| x) =σ(a) where a is LLR with Bayes oddsP(Ck| x) =exp(ak)exp(aj)j∑Generative vs Discriminative Training Independent variables x ={x1,..xn} and binary target y 1. Generative: estimate CPD parameters 2. Discriminative: estimate CRF parameters wi P(y,x)
View Full Document