DOC PREVIEW
MIT 6 867 - Lecture Notes

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

� � 1 6.867 Machine learning, lecture 15 (Jaakkola) Lecture topics: • Different types of mixture models (cont’d) • Estimating mixtures: the EM algorithm Mixture models (cont’d) Basic mixture model Mixture models try to capture and resolve observable ambiguities in the data. E.g., an m−component Gaussian mixture model mP (x; θ) = P (j)N(x; µj , Σj ) (1) j=1 The parameters θ include the mixing proportions (prior distribution) {P (j)}, means of component Gaussians {µj }, and covariances {Σj }. The notation {P (j)} is a shorthand for {P (j), j = 1, . . . , m}. To generate a sample x from such a model we would first sample j from the prior distribu-tion {P (j)}, then sample x from the selected component N(x; µj , Σj ). If we generated n samples, then we would get m potentially overlapping clusters of points, where each cluster center would correspond to one of the means µj and the number of points in the clusters would be approximately n P (j). This is the type of structure in the data that the mixture model is trying to capture if estimated on the basis of observed x samples. Student exam model: 1-ye ar We can model vectors of exam scores with mixture models. Each x is a vector of scores from a particular student and samples correspond to students. We expect that the population of students in a particular year consists of m different types (e.g., due to differences in background). If we expect each type to be present with an overall probability P (j), then each student score is modeled as a mixture mP (x θ) = P (x θj )P (j) (2) |j=1 |where we sum over the types (weighted by P (j)) since we don’t know what type of student t is prior to seeing their exam score xt. Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� � � � � � � � � � � � 2 6.867 Machine learning, lecture 15 (Jaakkola) If there are n students taking the exam a particular year, then the likelihood of all the student score vectors, D1 = {x1, . . . , xn}, would be n n mL(D1; θ) = P (xt|θ) = P (x|θj )P (j) (3) t=1 t=1 j=1 Student exam model: K-years Suppose now that we have student data from K years of offering the course. In year k we have nk students who took the course. Let xk,t denote the score vector for a student t in year k. Note that t is just an index to identify samples each year and the same index does not imply that the same student took the course multiple years. We can now assume that the number of student types as well as P (x|θj ) remain the same from year to year (the parameters θj are the same for all years). However, the population of students may easily change from year to year, and thus the prior probabilities over the types have to be set differently. Let P (j|k) denote the prior probabilities over the types in year k (all of these would have to be estimated of course). Now, according to our mixture distribution, we expect example scores for students in year k be sampled from mP (x k, θ) = P (x θj )P (j k) (4) |j=1 | |The likelihood of all the data, across K years, D = {D1, . . . , DK }, is given by �KnkKnkmL(D; θ) = P (xk,t|k, θ) = P (xk,t|θj )P (j|k) (5) k=1 t=1 k=1 t=1 j=1 The parameters θ here include the mixing portions {P (j|k)} that change from year to year in addition to {θj }. Collaborative filtering Mixture models are useful also in recommender systems. Suppose we have n users and m movies and our task is to recommend movies for users. The users have each rated a small fraction of movies and our task is to fill-in the rating matrix, i.e., provide a predicted rating for each user across all m movies. Such a prediction task is known as a collaborative filtering problem (see Figure 1). Say the possible ratings are rij ∈ {1, . . . , 5} (i.e., how many stars you assign to each movie). We will use i to index users and j for movies; a rating rij , if provided, specifies how user Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� � 3 6.867 Machine learning, lecture 15 (Jaakkola) rij 2 1 3 5 5 2 2 users i movies j Figure 1: Partially observed rating matrix for a collaborative filtering task. i rated movie j. Since only a small f raction of movies are rated by each user, we need a way to index these elements of the user/movie matrix: we say (i, j) ∈ ID if rating rij is available (observed). D denotes all the observed ratings. We can build on the previous discussion on mixture models. We can represent each movie as a distribution over “movie types” zm ∈ {1, . . . , Km}. Similarly, a user is represented by a distribution over “user types ” zu ∈ {1, . . . , Ku}. We do not assume that each movie corresponds to a single movie type across all users. Instead, we interpret the distribution over movie types as a bag of features corresponding to the movie and we resample from this bag in the context of each user. This is analogous to predicting exam scores for students in a particular year (we didn’t assume that all the students had the same type). We also make the same assumption about user types, i.e., that the type is sampled from the “bag” for each rating, resulting potentially in different types for different movies. Since all the unobserved quantities are summed over, we do not explicitly assign any fixed movie/user type to a rating. We imagine generating the rating rij associated with (i, j) element of the rating matrix as follows. Sample a movie type from P (zm|j), sample a user type from P (zu|i), then sample a rating rij with probability P (rij |zu, zm). All the probabilities involved have to estimated from the available data. Note that we will resample movie and user types for each rating. The resulting mixture model for rating rij is given by KuKmP (rij i, j, θ) = P (rij zu, zm)P (zui)P (zmj) (6) |zu=1 zm=1 | | |where the parameters θ refer to the mapping from types to ratings {P (r|zu, zm)} and the user and movie specific distributions {P (zu|i)} and {P (zm|j)}, respectively. The likelihood Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT


View Full Document

MIT 6 867 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?