MIT 6 867 - Machine learning Lecture - D70689

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 867> Machine learning Lecture

MIT 6 867 - Machine learning Lecture

School name Massachusetts Institute of Technology

Course 6 867- Machine Learning

Pages 22

Download Save

Unformatted text preview:

Machine learning lecture 13 Tommi S Jaakkola MIT AI Lab tommi ai mit edu Topics A bit more general view of the EM algorithm regularized mixtures Extensions of mixture models hierarchical mixture models conditional mixture models mixtures of experts Tommi Jaakkola MIT AI Lab 2 Mixture models review A two component Gaussian mixture model X p x pj p x j j j 1 2 where p1 p2 1 2 1 2 Only iterative solutions are available for finding the parameters that maximize the log likelihood l D n X log p xi i 1 where D x1 xn The estimation involves resolving which mixture component should be responsible for which data point Tommi Jaakkola MIT AI Lab 3 The EM algorithm The EM algorithm finds a local maximum of l D E step evaluate the expected complete log likelihood n X J t Ej P j xi t log pj p xi j j i 1 n X X P j xi t log pj p xi j j i 1 j 1 2 M step find the new parameters by maximizing the expected complete log likelhood t 1 argmax J t Tommi Jaakkola MIT AI Lab 4 Regularized EM algorithm To maximize a penalized regularized log likelihood l0 D n X log p xi log p i 1 we only need to modify the M step of the EM algorithm Specifically in the M step we find find that maximize a penalized expected complete log likelihood n X J t Ej P j xi t log pj p xi j j i 1 log p p1 p2 log p 1 log p 1 where for example p p1 p2 could be a Dirichlet and each p j a Wishart prior Tommi Jaakkola MIT AI Lab 5 Regularized EM demo Tommi Jaakkola MIT AI Lab 6 Selecting the number of components As a simple strategy for selecting the appropriate number of mixture components we can find k that minimize the following asymptotic approximation to the description length dk DL log p data k log n 2 where n is the number of training points k is the maximum likelihood parameter estimate for the k component mixture and dk is the effective number of parameters in the kmixture Tommi Jaakkola MIT AI Lab 7 Extensions hierarchical mixture models We have already used hierarchical mixture models in the digit classification problem Data generation model P y 1 P y 0 P c 1 y 1 P x y 1 c 1 P c 3 y 0 P x y 0 c 3 It is not necessary for the top level division to be observable as it is in this classification context Tommi Jaakkola MIT AI Lab 8 Hierarchical mixture models cont d To estimate such hierachical models from data we have to resolve which leaf path in the tree is responsible for generating which data point Only the E step needs to be revised the expectation over assignments is now taken with respect to zFirst level z Second level P y j c k x P y j x P c k y j x For example for a hierarchical mixture of Gaussians we evaluate n X J t E j k P j k xi t log pj pk j p xi j k j k i 1 where pj and pk j are the prior selection probabilities Tommi Jaakkola MIT AI Lab 9 Hierarchical mixture models cont d Arranging the mixture components into a hierarchy is useful only with additional topological constraints The hierarchical mixture as stated previously is otherwise equivalent to a flat mixture To adequately reveal any hierarchical organization in the data we have to prime the model to find such structure initialize parameters similarly within branches tying parameters etc Tommi Jaakkola MIT AI Lab j 0 j 1 k 1 x xx x x x x j 0 k 1 x x x x x x x j 1 j 0 k 2 j 1 k 2 x x x x x x x x x x x xx xx x 10 Conditional mixtures mixtures of experts Many regression or classification problems decomposed into smaller easier sub problems can be Examples 1 Dealing with style in handwritten character recognition 2 Dealing with dialect accent in speech recognition etc Each sub problem could be solved by a specific expert Unlike in ordinary mixtures the selection of which expert to rely on must depend on the context i e the input x Tommi Jaakkola MIT AI Lab 11 Experts Suppose we have several experts or component regression models generating conditional Gaussian outputs P y x i N y wiT x wi0 i2 where mean of y given x wiT x wi0 variance of y given x i2 i wi wi0 i2 denotes the parameters of the ith expert We need to find an appropriate way of allocating tasks to these experts linear regression models Tommi Jaakkola MIT AI Lab 12 Mixtures of experts Example 3 5 1 3 0 8 2 5 2 0 6 1 5 0 4 1 0 5 0 2 0 0 5 3 2 1 0 1 2 3 0 3 2 1 0 1 2 3 Here we need a switch or a gating network that selects the appropriate expert linear regression model as a function of the input x Tommi Jaakkola MIT AI Lab 13 Gating network A simple gating network is a probability distribution over the choice of the experts conditional on the input x Example in case of two experts 0 and 1 the gating network can be a logistic regression model P expert 1 x v v0 g vT x v0 where g z 1 e z 1 is the logistic function In case of m 2 experts the gating network can be a softmax model exp vjT x vj0 P expert j x Pm Tx v 0 exp v 0 j0 j 1 j0 where v1 vm v10 vm0 are the parameters in the gating network Tommi Jaakkola MIT AI Lab 14 Gating network cont d exp vjT x vj0 P expert j x Pm Tx v 0 exp v 0 j0 j 1 j0 x x xx x x x x xx x x x x x x x x x x x x x x x x x x x x x xx xx x Tommi Jaakkola MIT AI Lab x x x x x x x x x x x x x x x x x x x x x xx x 15 Mixtures of experts model The probability distribution over the regression output y given the input x is a conditional mixture model m X P y x P expert j x P y x j j 1 where defines the parameters of the gating network e g logistic and j are the parameters of each expert e g linear regression model 3 5 1 3 0 8 2 5 2 P expert 1 x P expert 0 x 0 6 1 5 0 4 1 0 5 0 2 0 P y expert 1 x P y expert 0 x 0 5 …

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT 6 867 - Machine learning Lecture

Sign up for free to view:

Please select your school