DOC PREVIEW
MIT 6 867 - Machine learning - lecture notes

This preview shows page 1-2-22-23 out of 23 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine learning: lecture 15Tommi S. JaakkolaMIT AI [email protected]• Mixture models– selecting the number of components– non-parametric mixtures– conditional mixtures: mixtures of expertsTommi Jaakkola, MIT CSAIL 2The number of mixture components• As a simple strategy for selecting the appropriate numberof mixture components, we can find m that minimizes theoverall description length:DL ≈ −log p(data|ˆθm) +dm2log(n)where n is the number of training points,ˆθmarethe maximum likelihood parameters for the m-componentmixture, and dmis the (effective) number of parameters inthe m-mixture.• This asymptotic approximation is also know n as BIC(Bayesian Information Criterion)Tommi Jaakkola, MIT CSAIL 3The number of mixture components: example• Typical cases−4 −2 0 2 4 6 8−4−2024681012−4 −2 0 2 4 6 8−4−2024681012m=1, -logP(data)=2017.38, penalty=14.98, DL=2032.36m=2, -logP(data)=1712.69, penalty=32.95, DL=1745.65m=3, -logP(data)=1711.40, penalty=50.93, DL=1762.32m=4, -logP(data)=1682.06, penalty=68.90, DL=1750.97Tommi Jaakkola, MIT CSAIL 4The number of mixture components: example• The best cases (out of many runs):−4 −2 0 2 4 6 8−4−2024681012−4 −2 0 2 4 6 8−4−2024681012m=1, -logP(data)=2017.38, penalty=14.98, DL=2032.36m=2, -logP(data)=1712.69, penalty=32.95, DL=1745.65m=3, -logP(data)=1678.56, penalty=50.93, DL=1729.49m=4, -logP(data)=1649.08, penalty=68.90, DL=1717.98Tommi Jaakkola, MIT CSAIL 5Topics• Mixture models– selecting the number of components– non-parametric mixtures– conditional mixtures: mixtures of expertsTommi Jaakkola, MIT CSAIL 6Beyond parametric density models• More mixture densities• We can approximate almost any distribution by includingmore and more components in the mixture modelp(x|θ) =mXj=1pjp(x|µj, Σj)Tommi Jaakkola, MIT CSAIL 7Non-parametric densities• We can even introduce onemixture component (Gaussian)per training exampleˆp(x; σ2) =1nnXi=1p(x|xi, σ2I)where n is the number ofexamples.Here the covariances are allequal and spherical; the singleparameter σ2controls thesmoothness of the resultingdensity es timate−1.5 −1 −0.5 0 0.5 1 1.5−0.500.511.5−1.5 −1 −0.5 0 0.5 1 1.5−0.500.511.5Tommi Jaakkola, MIT CSAIL 81-dim case: Parzen windows• We place a smooth Gaussian (or other) bump on eachtraining exampleˆpn(x; σ) =1nnXi=11σKx − xiσ, wherewhere the “kernel function” isK(z) =1√2πexp(−z2/2)(very different from SVMke rnels).−4 −3 −2 −1 0 1 2 3 400.050.10.150.20.250.30.350.40.45n = 50, σ = 0.02Tommi Jaakkola, MIT CSAIL 9Parzen windows: variable kernel width• We can also set the kernel width locallyk-neares t neighbor choice: letdikbe the distance from xitoits kthnearest neighborˆpn(x; k) =1nnXi=11dikKx − xidik• The estimate is smoother wherethere are only few data points−4 −3 −2 −1 0 1 2 3 400.10.20.30.40.50.60.7Tommi Jaakkola, MIT CSAIL 10Parzen windows: optimal kernel width• We still have to set the kernel width σ or the number ofnearest neighbors k• A practical solution: cross-validationLet ˆp−i(x; σ) be a parzen windows density estimateconstructed on the basis of n − 1 training examples leavingout xi.We select σ (or similarly k) that maximizes the leave-one-outlog-like lihoodCV (σ) =nXi=1log ˆp−i(xi; σ)Tommi Jaakkola, MIT CSAIL 11Parzen windows: multi-dimensional case• Multi-dimensional Parzen windows estimate:ˆpparzen(x) =1nnXi=1p(x|xi, σ2I)where n is the number of examples.• The covariance matrices are all equal and spherical. Thesingle parameter σ controls the smoothness of the densityestimate and can be set analogously−1.5 −1 −0.5 0 0.5 1 1.5−0.500.511.5−1.5 −1 −0.5 0 0.5 1 1.5−0.500.511.5Tommi Jaakkola, MIT CSAIL 12Topics• Mixture models– selecting the number of components– non-parametric mixtures– conditional mixtures: mixtures of expertsTommi Jaakkola, MIT CSAIL 13Conditional mixtures• Many regression or classification problems can bedecomposed into smaller (easier) sub problems• Examples:– style in handwritten character recognition– dialect/accent in speech recognitionetc.• Each sub-problem c ould be s olved by a specific but relativelysimple “expert”• Unlike in ordinary mixtures, the selection of which expert torely on must now depend on the context (the input x)Tommi Jaakkola, MIT CSAIL 14Exper ts• Suppose we have several “experts” or component regressionmodels generating conditional Gaussian outputsP (y|x, θi) = N(y; wTix + wi0, σ2i)wheremean of y given x = wTix + wi0variance of y given x = σ2iθi= {wi, wi0, σ2i} denotes the parameters of the ithexpert.• We nee d to find an appropriate way of allocating tasks tothese experts (linear regression models)Tommi Jaakkola, MIT CSAIL 15Mixtures of expertsExample:−3 −2 −1 0 1 2 3−0.500.511.522.533.5−3 −2 −1 0 1 2 300.20.40.60.81• Here we need a switch or a gating network that selects theappropriate expert (linear regress ion model) as a function ofthe input xTommi Jaakkola, MIT CSAIL 16Gating network• A simple gating network is a probability distribution over thechoice of the expert, conditional on the input x• Example: in case of just two experts (say 0 and 1), thegating network can be a logistic regression modelP (expert = 1|x, v, v0) = g( vTx + v0)where g(z) = (1 + e−z)−1is the logistic function.• When the number of experts m > 2, the gating network canbe a softm ax modelP (expert = j|x, η) =exp( vTjx + vj0)Pmj0=1exp( vTj0x + vj00)where η = {v1, . . . , vm, v10, . . . , vm0} denotes theparameters of the gating networkTommi Jaakkola, MIT CSAIL 17Gating network: example divisionsP (expert = j|x, η) =exp( vTjx + vj0)Pmj0=1exp( vTj0x + vj00)xxxxxxxxx xxxxxxxx xxxxxx xxxxxxxxxxxxxxxx xxxxxxxx xxxxxx xxxxxxxTommi Jaakkola, MIT CSAIL 18Mixtures of experts model• The distribution over possible outputs y given an input x isa conditional mixture modelP (y|x, θ, η) =mXj=1P (expert = j|x, η) P (y|x, θj)where η defines the parameters of the gating network (e.g.,logistic) and θjare the parameters of each expert (e.g., linearregression model).P(expert=1|x) P(expert=0|x)P(y | expert=1,x) P(y | expert=0,x)−3 −2 −1 0 1 2 3−0.500.511.522.533.5−3 −2 −1 0 1 2 300.20.40.60.81• The alloc ation of experts is made conditionally on the inputTommi Jaakkola, MIT CSAIL 19Estimation of mixtures of experts• The estimation would be again easy if we knew which


View Full Document

MIT 6 867 - Machine learning - lecture notes

Download Machine learning - lecture notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Machine learning - lecture notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Machine learning - lecture notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?