Machine learning: lecture 5Tommi S. JaakkolaMIT [email protected]• Classification– regression approach to classification– elementary decision theory– Fisher linear discriminant– Generative probabilistic classifiers– discriminative classifiers: logistic regressionTommi Jaakkola, MIT AI Lab 2ClassificationExample: digit recognition (8x8 binary digits)binary digit actual label target label in learning“2” 1“2” 1“1” 0“1” 0. . . . . .Tommi Jaakkola, MIT AI Lab 3Classification via regression• Suppose we ignore the fact that the output is binary (e.g.,0/1) rather than a continuous variableIn this case we can estimate a linear regression functionf(x; w) = w0+ w1x1+ . . . + wdxdsimply by minimizing the squared difference between thepredicte d output (continuous) and the observed label(binary):Jn(w) =1nnXi=1(yi− f(xi; w))2Tommi Jaakkola, MIT AI Lab 4Classification via regression cont’d• We can use the resulting regression functionf(x;ˆw) = ˆw0+ ˆw1x1+ . . . + ˆwdxdto classify any new (test) example x. For example, it seemssensible to say thatlabel = 1 if f(x; w) > 0.5, and label = 0 otherwisewhere f(x; w) = 0.5 defines the decision boundary.Tommi Jaakkola, MIT AI Lab 5Classification via regression cont’d• This is not optimal...−4 −2 0 2 4 6−4−20246−5 0 5 10 15 20−4−202468101214sometimes good sometimes badTommi Jaakkola, MIT AI Lab 6A bit of decision theory• Suppose we know the distribution of examples in each class,i.e., we know the class-conditional densities p(x|y = 0) andp(x|y = 1).How do we decide (optimally) which class a new e xample x0should belong to?−4 −3 −2 −1 0 1 2 3 400.10.20.30.4class 0 densityP(x|y=0) class 1 densityP(x|y=1) The optimal decisions in thesense of the lowest possiblemiss-classification error arebased on the log-likelihoodratioy = 1 if logp(x0|y = 1)p(x0|y = 0)> 0and y = 0 otherwiseTommi Jaakkola, MIT AI Lab 7Decision theory cont’d• When the examples fall more often in one class than another,we have to modify the decision rule a bit:y = 1 if logp(x0|y = 1)P (y = 1)p(x0|y = 0)P (y = 0)> 0and y = 0 otherwise• More generally, the Bayes optimal decisions are given byy0= arg maxy=0,1{ p(x0|y)P (y) } = arg maxy=0,1{ P (y|x0) }(this is optimal only if we have the correct densities and priorfrequencies)Tommi Jaakkola, MIT AI Lab 8Linear regression and projections• Evaluating any linear regression function (here in 2D)f(x; w) = w0+ w1x1+ w2x2= w0+ xT~wamounts to projecting the points x = [x1x2]Tto a lineparallel to~w.−5 0 5−5−4−3−2−1012345f(x;w) = 0.5 w f(x;w) = 0 f(x;w) = 1Since we are primarily interested in how the points in the twoclasses are separated by this projec tion, we can temporarilyforget the bias/offset term w0.Tommi Jaakkola, MIT AI Lab 9Beyond regression• By varying the lines (or~w) we get different levels ofseparation between the classes−5 0 5−5−4−3−2−1012345−5 0 5−5−4−3−2−1012345−5 0 5−5−4−3−2−1012345−5 0 5−5−4−3−2−1012345Tommi Jaakkola, MIT AI Lab 10Fisher linear discriminant• We find a direction~w in the input space such that theprojecte d points become “well-separated”.−5 0 5−5−4−3−2−1012345• Some notation:class 0: n0samples, mean µ0, covariance Σ0class 1: n1samples, mean µ1, covariance Σ1Tommi Jaakkola, MIT AI Lab 11Fisher linear discriminant cont’d• Estimation criterion: we find~w that maximizesJF isher(~w) =(Separation of projected means)2Sum of within class variances=(~wTµ1−~wTµ0)2~wT(n1Σ1+ n0Σ0)~w• The solution~w ∝ (n1Σ1+ n0Σ0)−1(µ1− µ0)is Bayes optimal for twonormal populations with equalcovariances (Σ1= Σ0)−5 0 5−5−4−3−2−1012345Tommi Jaakkola, MIT AI Lab 12Fisher linear discriminant analysis: example• Binary digits “1” versus “7”~w =cThis is roughly speaking the elementwise matrix difference~w ≈ -Tommi Jaakkola, MIT AI Lab 13Generative and discriminative classification• To further refine our classification approach we can adoptone of two general frameworks:1. Generative (model p(x|y))– directly build class-conditional densities over the multi-dimensional input examples– classify new examples based on the densities2. Discriminative (only model P (y|x))– only model decisions give n the input examples; no modelis constructed over the input examplesTommi Jaakkola, MIT AI Lab 14Generative approach to classification• We can directly model each class conditional population witha multi-variate normal (Gaussian) distributionx ∼ N(µ1, Σ1), y = 1x ∼ N(µ0, Σ0), y = 0wherep(x|µ, Σ) =1(2π)d/2|Σ|1/2exp{ −12(x − µ)TΣ−1(x − µ) }and x = [x1, . . . , xd]T.Tommi Jaakkola, MIT AI Lab 15Mixture classifier: decisions• Examples x are classified on the basis of which Gaussianexplains the data betterlogp(x|µ1, Σ1)p(x|µ0, Σ0)> 0 y = 1≤ 0 y = 0or, more generally, when the classes have different a prioriprobabilities, we use the posterior probabilityP (y = 1|x) ∝ p(x|µ1, Σ1)P (y = 1)• The corresponding decision boundaries arelogp(x|µ1, Σ1)p(x|µ0, Σ0)= 0 or P (y = 1|x) = 0.5Tommi Jaakkola, MIT AI Lab 16Mixture classifier: decision rule• Equal covariancesx ∼ N(µ1, Σ), y = 1x ∼ N(µ0, Σ), y = 0−2 −1 0 1 2 3 4 5 6 7−0.500.511.522.533.544.5• The decision boundary is linearTommi Jaakkola, MIT AI Lab 17Mixture classifier: decision rule• Unequal covariancesx ∼ N(µ1, Σ1), y = 1x ∼ N(µ0, Σ0), y = 0−4 −2 0 2 4 6 8−0.500.511.522.533.544.5• The decision boundary is quadraticTommi Jaakkola, MIT AI Lab 18Maximum likelihood estimation• We can estimate the class conditional densities p(x|µ, Σ)separatelyFor a multivariate Gaussian modelp(x|µ, Σ) =1(2π)p/2|Σ|1/2exp{ −12(x − µ)TΣ−1(x − µ) }the maximum likelihood estimates of the parameters basedon a random sample {x1, . . . , xn} are given by the samplemean and sample covariance:ˆµ =1nnXi=1xi,ˆΣ =1nnXi=1(xi− ˆµ)(xi− ˆµ)TTommi Jaakkola, MIT AI Lab 19Discriminative classification• If we are only interested in the classification decisions, whyshould we bother with a model over the input examples?−2 −1 0 1 2 3 4 5 6 7−0.500.511.522.533.544.5• We could try to directly estimate the conditional distributionof labels given the examples or P (y|x, θ) where θ ={µ0, µ1, Σ0, Σ1}.Tommi Jaakkola, MIT AI Lab 20Back to the Gaussians... (1-dim)• When the classe s are equally likely a priori, the
View Full Document