6 867 Machine learning lecture 15 Jaakkola 1 Lecture topics Di erent types of mixture models cont d Estimating mixtures the EM algorithm Mixture models cont d Basic mixture model Mixture models try to capture and resolve observable ambiguities in the data E g an m component Gaussian mixture model P x m P j N x j j 1 j 1 The parameters include the mixing proportions prior distribution P j means of component Gaussians j and covariances j The notation P j is a shorthand for P j j 1 m To generate a sample x from such a model we would rst sample j from the prior distribu tion P j then sample x from the selected component N x j j If we generated n samples then we would get m potentially overlapping clusters of points where each cluster center would correspond to one of the means j and the number of points in the clusters would be approximately n P j This is the type of structure in the data that the mixture model is trying to capture if estimated on the basis of observed x samples Student exam model 1 year We can model vectors of exam scores with mixture models Each x is a vector of scores from a particular student and samples correspond to students We expect that the population of students in a particular year consists of m di erent types e g due to di erences in background If we expect each type to be present with an overall probability P j then each student score is modeled as a mixture P x m P x j P j 2 j 1 where we sum over the types weighted by P j since we don t know what type of student t is prior to seeing their exam score xt Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 15 Jaakkola 2 If there are n students taking the exam a particular year then the likelihood of all the student score vectors D1 x1 xn would be m n n L D1 P xt P x j P j 3 t 1 t 1 j 1 Student exam model K years Suppose now that we have student data from K years of o ering the course In year k we have nk students who took the course Let xk t denote the score vector for a student t in year k Note that t is just an index to identify samples each year and the same index does not imply that the same student took the course multiple years We can now assume that the number of student types as well as P x j remain the same from year to year the parameters j are the same for all years However the population of students may easily change from year to year and thus the prior probabilities over the types have to be set di erently Let P j k denote the prior probabilities over the types in year k all of these would have to be estimated of course Now according to our mixture distribution we expect example scores for students in year k be sampled from P x k m P x j P j k 4 j 1 The likelihood of all the data across K years D D1 DK is given by m nk nk K K L D P xk t k P xk t j P j k k 1 t 1 k 1 t 1 5 j 1 The parameters here include the mixing portions P j k that change from year to year in addition to j Collaborative ltering Mixture models are useful also in recommender systems Suppose we have n users and m movies and our task is to recommend movies for users The users have each rated a small fraction of movies and our task is to ll in the rating matrix i e provide a predicted rating for each user across all m movies Such a prediction task is known as a collaborative ltering problem see Figure 1 Say the possible ratings are rij 1 5 i e how many stars you assign to each movie We will use i to index users and j for movies a rating rij if provided speci es how user Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 15 Jaakkola 3 movies j 3 rij users i 5 1 2 2 5 2 Figure 1 Partially observed rating matrix for a collaborative ltering task i rated movie j Since only a small fraction of movies are rated by each user we need a way to index these elements of the user movie matrix we say i j ID if rating rij is available observed D denotes all the observed ratings We can build on the previous discussion on mixture models We can represent each movie as a distribution over movie types zm 1 Km Similarly a user is represented by a distribution over user types zu 1 Ku We do not assume that each movie corresponds to a single movie type across all users Instead we interpret the distribution over movie types as a bag of features corresponding to the movie and we resample from this bag in the context of each user This is analogous to predicting exam scores for students in a particular year we didn t assume that all the students had the same type We also make the same assumption about user types i e that the type is sampled from the bag for each rating resulting potentially in di erent types for di erent movies Since all the unobserved quantities are summed over we do not explicitly assign any xed movie user type to a rating We imagine generating the rating rij associated with i j element of the rating matrix as follows Sample a movie type from P zm j sample a user type from P zu i then sample a rating rij with probability P rij zu zm All the probabilities involved have to estimated from the available data Note that we will resample movie and user types for each rating The resulting mixture model for rating rij is given by P rij i j Ku Km P rij zu zm P zu i P zm j 6 zu 1 zm 1 where the parameters refer to the mapping from types to ratings P r zu zm and the user and movie speci c distributions P zu i and P zm j respectively The likelihood Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 15 Jaakkola 4 of the observed data D is L D P rij i j 7 i j …
View Full Document
Unlocking...