Unformatted text preview:

Machine learning lecture 17 Tommi S Jaakkola MIT CSAIL tommi csail mit edu Topics Clustering cont d semi supervised clustering clustering by dynamics Structured probability models hidden Markov models Tommi Jaakkola MIT CSAIL 2 Overview of clustering methods Flat clustering methods e g mixture models k means clustering Hierarchical clustering methods Top down splitting e g hierarchical mixture models Bottom up merging e g hierarchical agglomerative clustering Spectral clustering Semi supervised clustering Clustering by dynamics Etc Tommi Jaakkola MIT CSAIL 3 Semi supervised clustering Let s assume we have some additional relevance information about the examples and we d like clusters to preserve this information as much as possible This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled For example by merging together documents we do not wish to loose information about the words they contain word distributions Tommi Jaakkola MIT CSAIL 4 Semi supervised clustering Let s assume we have some additional relevance information about the examples and we d like clusters to preserve this information as much as possible This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled For example by merging together documents we do not wish to loose information about the words they contain word distributions xi Training example e g a text document y Relevance variable e g a word P y xi Relevance information e g word distribution Tommi Jaakkola MIT CSAIL 5 Semi supervised clustering cont d We cluster documents xi on the basis of their word distributions P y xi The word distribution for a cluster is the average word distribution of documents in the cluster 1 X P y C P y xi C P y C2 P y C1 i C P y x1 P y x2 P y x3 When merging two clusters we need to take into account their sizes for example if C2 C1 x3 then 1 P y C2 2 P y C1 1 P y x3 2 1 Tommi Jaakkola MIT CSAIL 6 Semi supervised clustering cont d We still need to specify a distance metric to determine which clusters to merge and in what order The distance should reflect how much relevance information we loose by merging the clusters d Ck Cl Ck Cl I y cluster identity X P y Ck Ck P y Ck log P y Ck Ck y P y Ck Cl Cl X y P y Ck P y Cl P y Cl log P y Ck Ck P y Cl Tommi Jaakkola MIT CSAIL 7 Semi supervised clustering example Suppose we have a set of labeled examples x1 y1 xn yn and we take the label as the relevance variable 1 if y yi P y xi 0 otherwise Tommi Jaakkola MIT CSAIL 8 Clustering by dynamics We may wish to cluster time course signals not by direct comparison but in terms of dynamics that governs the signals system behavior monitoring anomaly detection biosequencies processes etc 1 0010011001000101000001000011101101010100 2 0101111110100110101000001000000101011001 3 1101011000000110110010001101111101011101 4 1101010111101011110111101101101101000101 We will use Markov models to capture the dynamics The distance metric for clustering is based on similarity of models Tommi Jaakkola MIT CSAIL 9 Modeling time course signals Full probability model P s1 st P s1 P s2 s1 P s3 s2 s1 P s4 s3 s2 s1 s1 s2 s3 s4 First order Markov model P s1 st P s1 P s2 s1 P s3 s2 P s4 s3 s1 Tommi Jaakkola MIT CSAIL s2 s3 s4 10 Discrete Markov models Representation in terms of variables and dependencies a graphical model P s1 st P s1 P s2 s1 P s3 s2 P s4 s3 s1 s2 s3 s4 Representation in terms of state transitions transition diagram P s2 s1 P s3 s2 s1 s2 s3 P s1 P s1 Tommi Jaakkola MIT CSAIL 11 Discrete Markov models properties The values of each st are known as states 5 5 1 5 5 s1 P 5 5 1 s2 0 1 s3 P1 0 5 0 5 0 5 0 5 0 0 1 0 When successive state transitions are governed by the same one step transition probability matrix P1 st st 1 the Markov model is homogeneous Tommi Jaakkola MIT CSAIL 12 Discrete Markov models properties The values of each st are known as states 5 5 1 5 5 s1 P 5 5 1 s2 0 1 s3 P1 0 5 0 5 0 5 0 5 0 0 1 0 When successive state transitions are governed by the same one step transition probability matrix P1 st st 1 the Markov model is homogeneous Example a language model This is a boring Is a homogeneous Markov model appropriate in this case Tommi Jaakkola MIT CSAIL 13 Discrete Markov models properties The values of each st are known as states 5 5 1 5 5 s1 P 5 5 1 s2 0 1 s3 P1 0 5 0 5 0 5 0 5 0 0 1 0 When successive state transitions are governed by the same one step transition probability matrix P1 st st 1 the Markov model is homogeneous If after k transitions we can get from any state i to any other state j the markov chain is ergodic More precisely the multi step transition probabilities must satisfy Pk s r 0 for all r s and some fixed k Tommi Jaakkola MIT CSAIL 14 Discrete Markov models ML estimation S 1 0010011001000101000001000011101101010100 S 2 0101111110100110101000001000000101011001 40 X X i i i 1 2 log P s1 log P1 st st 1 l S S i 1 2 t 2 ML estimates of the parameters initial state and transition probabilities are based on simple counts n s of times s1 s n r s number of times r s n s P s P 0 n s 0 s n r s P 1 s r P 0 n r s 0 s Tommi Jaakkola MIT CSAIL 15 Simple clustering example cont d Four binary sequences of length 40 1 2 3 4 0010011001000101000001000011101101010100 0101111110100110101000001000000101011001 1101011000000110110010001101111101011101 1101010111101011110111101101101101000101 We can estimate a Markov model based on any subset of the sequences We still need to derive the clustering metric based on the resulting transitition probabilities dynamics Tommi Jaakkola MIT CSAIL 16 Clustering metric To determine a distance between two arbitrary sequences 1 2 2 2 s and S s S 1 s1 s 1 n2 n1 1 we measure how well a Markov model would capture the sequences separately or jointly l S 1 log P S 1 1 l S 2 log P S 2 2 l S 1 S 2 log P S 2 …


View Full Document

MIT 6 867 - Lecture Notes

Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?