Machine learning lecture 5 Tommi S Jaakkola MIT AI Lab tommi ai mit edu Topics Classification regression approach to classification elementary decision theory Fisher linear discriminant Generative probabilistic classifiers discriminative classifiers logistic regression Tommi Jaakkola MIT AI Lab 2 Classification Example digit recognition 8x8 binary digits binary digit Tommi Jaakkola MIT AI Lab actual label target label in learning 2 1 2 1 1 0 1 0 3 Classification via regression We ignore the fact that the output is binary e g 0 1 rather than a continuous variable Given a linear regression function f x w w0 w1x1 wdxd we minimize the squared difference between the predicted output continuous and the observed label binary n 1X Jn w yi f xi w 2 2 i 1 How do we classify any new example x Tommi Jaakkola MIT AI Lab 4 Classification via regression cont d f x w w0 w1x1 wdxd Any new test example x can be classified according to label 1 if f x w 0 5 and label 0 otherwise where f x w 0 5 defines the decision boundary Tommi Jaakkola MIT AI Lab 5 Classification via regression cont d This is not optimal 6 14 12 4 10 8 2 6 4 0 2 0 2 2 4 4 2 0 2 4 sometimes good Tommi Jaakkola MIT AI Lab 6 4 5 0 5 10 15 20 sometimes bad 6 A bit of decision theory Suppose we know the distribution of examples in each class i e we know the class conditional densities p x y 0 and p x y 1 How do we decide optimally which class a new example x0 should belong to The optimal decisions in the sense of the lowest possible miss classification error are based on the log likelihood ratio 0 4 class 1 density P x y 1 0 3 class 0 density P x y 0 0 2 p x0 y 1 y 1 if log 0 0 p x y 0 0 1 0 4 3 2 1 0 1 2 3 4 and y 0 otherwise Tommi Jaakkola MIT AI Lab 7 Decision theory cont d When the examples fall more often in one class than another we have to modify the decision rule a bit p x0 y 1 P y 1 y 1 if log 0 0 p x y 0 P y 0 and y 0 otherwise More generally the Bayes optimal decisions are given by y 0 arg max p x0 y P y arg max P y x0 y 0 1 y 0 1 this is optimal only if we have the correct densities and prior frequencies Tommi Jaakkola MIT AI Lab 8 Linear regression and projections Evaluating any linear regression function here in 2D f x w w0 w1x1 w2x2 w0 xT w amounts to projecting the points x x1 x2 T to a line parallel to w 5 f x w 1 f x w 0 5 4 f x w 0 3 2 1 0 1 w 2 3 4 5 5 0 5 Since we are primarily interested in how the points in the two classes are separated by this projection we can temporarily forget the bias offset term w0 Tommi Jaakkola MIT AI Lab 9 Beyond regression we get different levels of By varying the lines or w separation between the classes 5 5 4 4 3 3 2 2 1 1 0 0 1 1 2 2 3 3 4 4 5 5 0 5 5 5 5 5 4 4 3 3 2 2 1 1 0 0 1 1 2 2 3 3 4 4 5 5 Tommi Jaakkola MIT AI Lab 0 5 5 5 0 5 0 5 10 Fisher linear discriminant in the input space such that the We find a direction w projected points become well separated 5 4 3 2 1 0 1 2 3 4 5 5 0 5 Some notation class 0 n0 samples mean 0 covariance 0 class 1 n1 samples mean 1 covariance 1 Tommi Jaakkola MIT AI Lab 11 Fisher linear discriminant cont d that maximizes Estimation criterion we find w Separation of projected means 2 JF isher w Sum of within class variances T 0 2 wT 1 w T n1 1 n0 0 w w 5 The solution 4 3 n1 1 n0 0 1 1 0 w 2 1 0 is Bayes optimal for two normal populations with equal covariances 1 0 Tommi Jaakkola MIT AI Lab 1 2 3 4 5 5 0 5 12 Fisher linear discriminant analysis example Binary digits 1 versus 7 c w This is roughly speaking the elementwise matrix difference w Tommi Jaakkola MIT AI Lab 13 Generative and discriminative classification To further refine our classification approach we can adopt one of two general frameworks 1 Generative model p x y directly build class conditional densities over the multidimensional input examples classify new examples based on the densities 2 Discriminative only model P y x only model decisions given the input examples no model is constructed over the input examples Tommi Jaakkola MIT AI Lab 14 Generative approach to classification We can directly model each class conditional population with a multi variate normal Gaussian distribution x N 1 1 y 1 x N 0 0 y 0 where p x 1 2 d 2 1 2 1 exp x T 1 x 2 and x x1 xd T Tommi Jaakkola MIT AI Lab 15 Mixture classifier decisions Examples x are classified on the basis of which Gaussian explains the data better p x 1 1 log 0 y 1 p x 0 0 0 y 0 or more generally when the classes have different a priori probabilities we use the posterior probability P y 1 x p x 1 1 P y 1 The corresponding decision boundaries are p x 1 1 0 or P y 1 x 0 5 log p x 0 0 Tommi Jaakkola MIT AI Lab 16 Mixture classifier decision rule Equal covariances x N 1 y 1 x N 0 y 0 4 5 4 3 5 3 2 5 2 1 5 1 0 5 0 0 5 2 1 0 1 2 3 4 5 6 7 The decision rule is linear Tommi Jaakkola MIT AI Lab 17 Mixture classifier decision rule Unequal covariances x N 1 1 y 1 x N 0 0 y 0 4 5 4 3 5 3 2 5 2 1 5 1 0 5 0 0 5 4 2 0 2 4 6 8 The decision rule is quadratic Tommi Jaakkola MIT AI Lab 18 Maximum likelihood estimation We can estimate the class conditional densities p x separately For a multivariate Gaussian model 1 1 T 1 p x x x exp p 2 1 2 2 2 the maximum likelihood estimates of the parameters based on a random sample x1 xn are given by the sample mean and sample covariance n X 1 n i 1 Tommi Jaakkola MIT AI Lab xi n X 1 n i 1 xi xi T 19 Discriminative classification If we are …
View Full Document
Unlocking...