Unformatted text preview:

Machine learning lecture 15 Tommi S Jaakkola MIT AI Lab tommi ai mit edu Topics Mixture models selecting the number of components non parametric mixtures conditional mixtures mixtures of experts Tommi Jaakkola MIT CSAIL 2 The number of mixture components As a simple strategy for selecting the appropriate number of mixture components we can find m that minimizes the overall description length dm DL log p data m log n 2 where n is the number of training points m are the maximum likelihood parameters for the m component mixture and dm is the effective number of parameters in the m mixture This asymptotic approximation is also known as BIC Bayesian Information Criterion Tommi Jaakkola MIT CSAIL 3 The number of mixture components example Typical cases 12 12 10 10 8 8 6 6 4 4 2 2 0 0 2 2 4 4 m 1 m 2 m 3 m 4 2 0 2 4 6 8 logP data 2017 38 logP data 1712 69 logP data 1711 40 logP data 1682 06 Tommi Jaakkola MIT CSAIL 4 4 2 0 2 penalty 14 98 penalty 32 95 penalty 50 93 penalty 68 90 4 6 8 DL 2032 36 DL 1745 65 DL 1762 32 DL 1750 97 4 The number of mixture components example The best cases out of many runs 12 12 10 10 8 8 6 6 4 4 2 2 0 0 2 2 4 4 m 1 m 2 m 3 m 4 2 0 2 4 6 8 logP data 2017 38 logP data 1712 69 logP data 1678 56 logP data 1649 08 Tommi Jaakkola MIT CSAIL 4 4 2 0 2 penalty 14 98 penalty 32 95 penalty 50 93 penalty 68 90 4 6 8 DL 2032 36 DL 1745 65 DL 1729 49 DL 1717 98 5 Topics Mixture models selecting the number of components non parametric mixtures conditional mixtures mixtures of experts Tommi Jaakkola MIT CSAIL 6 Beyond parametric density models More mixture densities We can approximate almost any distribution by including more and more components in the mixture model p x m X pj p x j j j 1 Tommi Jaakkola MIT CSAIL 7 Non parametric densities We can even introduce one mixture component Gaussian per training example n X 1 2 p x xi 2 I p x n i 1 where n is the number of examples 1 5 1 0 5 0 0 5 1 5 1 0 5 0 0 5 1 1 5 1 0 5 0 0 5 1 1 5 1 5 1 Here the covariances are all equal and spherical the single parameter 2 controls the smoothness of the resulting density estimate Tommi Jaakkola MIT CSAIL 0 5 0 0 5 1 5 8 1 dim case Parzen windows We place a smooth Gaussian or other bump on each training example n X 1 x xi 1 p n x K where n i 1 0 45 0 4 where the kernel function is 1 K z exp z 2 2 2 very different kernels from SVM 0 35 0 3 0 25 0 2 0 15 0 1 0 05 0 4 3 2 1 0 1 2 3 4 n 50 0 02 Tommi Jaakkola MIT CSAIL 9 Parzen windows variable kernel width We can also set the kernel width locally k nearest neighbor choice let dik be the distance from xi to its k th nearest neighbor n X 1 1 x xi p n x k K n i 1 dik dik 0 7 0 6 0 5 0 4 0 3 0 2 0 1 The estimate is smoother where there are only few data points Tommi Jaakkola MIT CSAIL 0 4 3 2 1 0 1 2 3 4 10 Parzen windows optimal kernel width We still have to set the kernel width or the number of nearest neighbors k A practical solution cross validation Let p i x be a parzen windows density estimate constructed on the basis of n 1 training examples leaving out xi We select or similarly k that maximizes the leave one out log likelihood CV n X log p i xi i 1 Tommi Jaakkola MIT CSAIL 11 Parzen windows multi dimensional case Multi dimensional Parzen windows estimate n X 1 p x xi 2 I p parzen x n i 1 where n is the number of examples The covariance matrices are all equal and spherical The single parameter controls the smoothness of the density estimate and can be set analogously 1 5 1 5 1 1 0 5 0 5 0 0 0 5 1 5 1 0 5 Tommi Jaakkola MIT CSAIL 0 0 5 1 1 5 0 5 1 5 1 0 5 0 0 5 1 1 5 12 Topics Mixture models selecting the number of components non parametric mixtures conditional mixtures mixtures of experts Tommi Jaakkola MIT CSAIL 13 Conditional mixtures Many regression or classification problems decomposed into smaller easier sub problems can be Examples style in handwritten character recognition dialect accent in speech recognition etc Each sub problem could be solved by a specific but relatively simple expert Unlike in ordinary mixtures the selection of which expert to rely on must now depend on the context the input x Tommi Jaakkola MIT CSAIL 14 Experts Suppose we have several experts or component regression models generating conditional Gaussian outputs P y x i N y wiT x wi0 i2 where mean of y given x wiT x wi0 variance of y given x i2 i wi wi0 i2 denotes the parameters of the ith expert We need to find an appropriate way of allocating tasks to these experts linear regression models Tommi Jaakkola MIT CSAIL 15 Mixtures of experts Example 3 5 1 3 0 8 2 5 2 0 6 1 5 0 4 1 0 5 0 2 0 0 5 3 2 1 0 1 2 3 0 3 2 1 0 1 2 3 Here we need a switch or a gating network that selects the appropriate expert linear regression model as a function of the input x Tommi Jaakkola MIT CSAIL 16 Gating network A simple gating network is a probability distribution over the choice of the expert conditional on the input x Example in case of just two experts say 0 and 1 the gating network can be a logistic regression model P expert 1 x v v0 g vT x v0 where g z 1 e z 1 is the logistic function When the number of experts m 2 the gating network can be a softmax model exp vjT x vj0 P expert j x Pm Tx v 0 exp v 0 j0 j 1 j0 where v1 vm v10 vm0 parameters of the gating network Tommi Jaakkola MIT CSAIL denotes the 17 Gating network example divisions exp vjT x vj0 P expert j x Pm Tx v 0 exp v 0 j0 j 1 j0 x x xx x x x x xx x x x x x x x x x x x x x x x x x …


View Full Document

MIT 6 867 - Machine learning - lecture notes

Loading Unlocking...
Login

Join to view Machine learning - lecture notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Machine learning - lecture notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?