DOC PREVIEW
MIT 6 867 - Machine learning: lecture 5

This preview shows page 1-2-23-24 out of 24 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine learning: lecture 5Tommi S. JaakkolaMIT [email protected]• Classification– regression approach to classification– elementary decision theory– Fisher linear discriminant– Generative probabilistic classifiers– discriminative classifiers: logistic regressionTommi Jaakkola, MIT AI Lab 2ClassificationExample: digit recognition (8x8 binary digits)binary digit actual label target label in learning“2” 1“2” 1“1” 0“1” 0. . . . . .Tommi Jaakkola, MIT AI Lab 3Classification via regression• Suppose we ignore the fact that the output is binary (e.g.,0/1) rather than a continuous variableIn this case we can estimate a linear regression functionf(x; w) = w0+ w1x1+ . . . + wdxdsimply by minimizing the squared difference between thepredicte d output (continuous) and the observed label(binary):Jn(w) =1nnXi=1(yi− f(xi; w))2Tommi Jaakkola, MIT AI Lab 4Classification via regression cont’d• We can use the resulting regression functionf(x;ˆw) = ˆw0+ ˆw1x1+ . . . + ˆwdxdto classify any new (test) example x. For example, it seemssensible to say thatlabel = 1 if f(x; w) > 0.5, and label = 0 otherwisewhere f(x; w) = 0.5 defines the decision boundary.Tommi Jaakkola, MIT AI Lab 5Classification via regression cont’d• This is not optimal...−4 −2 0 2 4 6−4−20246−5 0 5 10 15 20−4−202468101214sometimes good sometimes badTommi Jaakkola, MIT AI Lab 6A bit of decision theory• Suppose we know the distribution of examples in each class,i.e., we know the class-conditional densities p(x|y = 0) andp(x|y = 1).How do we decide (optimally) which class a new e xample x0should belong to?−4 −3 −2 −1 0 1 2 3 400.10.20.30.4class 0 densityP(x|y=0) class 1 densityP(x|y=1) The optimal decisions in thesense of the lowest possiblemiss-classification error arebased on the log-likelihoodratioy = 1 if logp(x0|y = 1)p(x0|y = 0)> 0and y = 0 otherwiseTommi Jaakkola, MIT AI Lab 7Decision theory cont’d• When the examples fall more often in one class than another,we have to modify the decision rule a bit:y = 1 if logp(x0|y = 1)P (y = 1)p(x0|y = 0)P (y = 0)> 0and y = 0 otherwise• More generally, the Bayes optimal decisions are given byy0= arg maxy=0,1{ p(x0|y)P (y) } = arg maxy=0,1{ P (y|x0) }(this is optimal only if we have the correct densities and priorfrequencies)Tommi Jaakkola, MIT AI Lab 8Linear regression and projections• Evaluating any linear regression function (here in 2D)f(x; w) = w0+ w1x1+ w2x2= w0+ xT~wamounts to projecting the points x = [x1x2]Tto a lineparallel to~w.−5 0 5−5−4−3−2−1012345f(x;w) = 0.5 w f(x;w) = 0 f(x;w) = 1Since we are primarily interested in how the points in the twoclasses are separated by this projec tion, we can temporarilyforget the bias/offset term w0.Tommi Jaakkola, MIT AI Lab 9Beyond regression• By varying the lines (or~w) we get different levels ofseparation between the classes−5 0 5−5−4−3−2−1012345−5 0 5−5−4−3−2−1012345−5 0 5−5−4−3−2−1012345−5 0 5−5−4−3−2−1012345Tommi Jaakkola, MIT AI Lab 10Fisher linear discriminant• We find a direction~w in the input space such that theprojecte d points become “well-separated”.−5 0 5−5−4−3−2−1012345• Some notation:class 0: n0samples, mean µ0, covariance Σ0class 1: n1samples, mean µ1, covariance Σ1Tommi Jaakkola, MIT AI Lab 11Fisher linear discriminant cont’d• Estimation criterion: we find~w that maximizesJF isher(~w) =(Separation of projected means)2Sum of within class variances=(~wTµ1−~wTµ0)2~wT(n1Σ1+ n0Σ0)~w• The solution~w ∝ (n1Σ1+ n0Σ0)−1(µ1− µ0)is Bayes optimal for twonormal populations with equalcovariances (Σ1= Σ0)−5 0 5−5−4−3−2−1012345Tommi Jaakkola, MIT AI Lab 12Fisher linear discriminant analysis: example• Binary digits “1” versus “7”~w =cThis is roughly speaking the elementwise matrix difference~w ≈ -Tommi Jaakkola, MIT AI Lab 13Generative and discriminative classification• To further refine our classification approach we can adoptone of two general frameworks:1. Generative (model p(x|y))– directly build class-conditional densities over the multi-dimensional input examples– classify new examples based on the densities2. Discriminative (only model P (y|x))– only model decisions give n the input examples; no modelis constructed over the input examplesTommi Jaakkola, MIT AI Lab 14Generative approach to classification• We can directly model each class conditional population witha multi-variate normal (Gaussian) distributionx ∼ N(µ1, Σ1), y = 1x ∼ N(µ0, Σ0), y = 0wherep(x|µ, Σ) =1(2π)d/2|Σ|1/2exp{ −12(x − µ)TΣ−1(x − µ) }and x = [x1, . . . , xd]T.Tommi Jaakkola, MIT AI Lab 15Mixture classifier: decisions• Examples x are classified on the basis of which Gaussianexplains the data betterlogp(x|µ1, Σ1)p(x|µ0, Σ0)> 0 y = 1≤ 0 y = 0or, more generally, when the classes have different a prioriprobabilities, we use the posterior probabilityP (y = 1|x) ∝ p(x|µ1, Σ1)P (y = 1)• The corresponding decision boundaries arelogp(x|µ1, Σ1)p(x|µ0, Σ0)= 0 or P (y = 1|x) = 0.5Tommi Jaakkola, MIT AI Lab 16Mixture classifier: decision rule• Equal covariancesx ∼ N(µ1, Σ), y = 1x ∼ N(µ0, Σ), y = 0−2 −1 0 1 2 3 4 5 6 7−0.500.511.522.533.544.5• The decision boundary is linearTommi Jaakkola, MIT AI Lab 17Mixture classifier: decision rule• Unequal covariancesx ∼ N(µ1, Σ1), y = 1x ∼ N(µ0, Σ0), y = 0−4 −2 0 2 4 6 8−0.500.511.522.533.544.5• The decision boundary is quadraticTommi Jaakkola, MIT AI Lab 18Maximum likelihood estimation• We can estimate the class conditional densities p(x|µ, Σ)separatelyFor a multivariate Gaussian modelp(x|µ, Σ) =1(2π)p/2|Σ|1/2exp{ −12(x − µ)TΣ−1(x − µ) }the maximum likelihood estimates of the parameters basedon a random sample {x1, . . . , xn} are given by the samplemean and sample covariance:ˆµ =1nnXi=1xi,ˆΣ =1nnXi=1(xi− ˆµ)(xi− ˆµ)TTommi Jaakkola, MIT AI Lab 19Discriminative classification• If we are only interested in the classification decisions, whyshould we bother with a model over the input examples?−2 −1 0 1 2 3 4 5 6 7−0.500.511.522.533.544.5• We could try to directly estimate the conditional distributionof labels given the examples or P (y|x, θ) where θ ={µ0, µ1, Σ0, Σ1}.Tommi Jaakkola, MIT AI Lab 20Back to the Gaussians... (1-dim)• When the classe s are equally likely a priori, the


View Full Document

MIT 6 867 - Machine learning: lecture 5

Download Machine learning: lecture 5
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Machine learning: lecture 5 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Machine learning: lecture 5 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?