DOC PREVIEW
MIT 6 867 - Lecture Notes

This preview shows page 1-2-3-26-27-28 out of 28 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine learning: lecture 17Tommi S. JaakkolaMIT [email protected]• Clustering cont’d– semi-supervised clustering– clustering by dynamics• Structured probability models– hidden Markov modelsTommi Jaakkola, MIT CSAIL 2Overview of clustering methods• Flat clustering methods– e.g., mixture models, k-means clustering• Hierarchical clustering methods:– Top-down (splitting)∗ e.g., hierarchical mixture models– Bottom-up (merging)∗ e.g., hierarchical agglomerative clustering• Spectral clustering• Semi-supervised clustering• Clustering by dynamicsEtc.Tommi Jaakkola, MIT CSAIL 3Semi-supervised clustering• Let’s assume we have some additional relevance informationabout the examples and we’d like clusters to preserve thisinformation as much as possible.This paper shows that theaccuracy of learned textclassifiers can be improved byaugmenting a small number oflabeled training documentswith a large pool of unlabeled· · ·This paper shows that theaccuracy of learned textclassifiers can be improved byaugmenting a small number oflabeled training documentswith a large pool of unlabeledFor example, by merging together documents we do not wishto loose information about the words they c ontain (worddistributions).Tommi Jaakkola, MIT CSAIL 4Semi-supervised clustering• Let’s assume we have some additional relevance informationabout the examples and we’d like clusters to preserve thisinformation as much as possible.This paper shows that theaccuracy of learned textclassifiers can be improved byaugmenting a small number oflabeled training documentswith a large pool of unlabeled· · ·This paper shows that theaccuracy of learned textclassifiers can be improved byaugmenting a small number oflabeled training documentswith a large pool of unlabeledFor example, by merging together documents we do not wishto loose information about the words they c ontain (worddistributions).xiTraining example (e.g., a text document)y Relevance variable (e.g., a word)P (y|xi) Relevance information (e.g., word distribution)Tommi Jaakkola, MIT CSAIL 5Semi-supervised clustering cont’d• We cluster documents {xi} on the basis of their worddistributions {P (y|xi)}• The word distribution for acluster is the average worddistribution of documents in theclusterP (y|C) =1|C|Xi∈CP (y|xi)P (y|x3)P (y|x2)P (y|C2)P (y|C1)P (y|x1)• When merging two clusters we need to take into accounttheir sizes: for example, if C2= C1∪ x3thenP (y|C2) =12 + 12 · P (y|C1) + 1 · P (y|x3)Tommi Jaakkola, MIT CSAIL 6Semi-supervised clustering cont’d• We still need to specify a distance metric to determine whichclusters to merge and in what order.• The distance should reflect how m uch relevance informationwe loose by merging the clustersd(Ck, Cl) =|Ck| + |Cl|· I(y; cluster identity)= |Ck|XyP (y|Ck) logP (y|Ck)P (y|Ck∪ Ck)+|Cl|XyP (y|Cl) logP (y|Cl)P (y|Ck∪ Ck)P (y|Cl)P (y|Ck)P (y|Ck∪ Cl)Tommi Jaakkola, MIT CSAIL 7Semi-supervised clustering: example• Suppose we have a set of labeled examples(x1, y1), . . . , (xn, yn) and we take the label as the relevancevariable:P (y|xi) =1, if y = yi0, otherwiseTommi Jaakkola, MIT CSAIL 8Clustering by dynamics• We may wis h to cluster time c ourse signals not by directcomparison but in terms of dynamics that governs the signals– system behavior monitoring (anomaly detection)– biosequencies, processesetc.1. 00100110010001010000010000111011010101002. 01011111101001101010000010000001010110013. 11010110000001101100100011011111010111014. 1101010111101011110111101101101101000101• We will use Markov models to capture the dynamics. Thedistance metric for clustering is based on similarity of models.Tommi Jaakkola, MIT CSAIL 9Modeling time course signals• Full probability modelP (s1, . . . , st) = P (s1)P (s2|s1)P (s3|s2, s1)P (s4|s3, s2, s1) · · ·s1s3s4s2• First order Markov modelP (s1, . . . , st) = P (s1)P (s2|s1)P (s3|s2)P (s4|s3) · · ·s1s3s4s2Tommi Jaakkola, MIT CSAIL 10Discrete Markov mode ls• Repres entation in terms of variables and dependencies (agraphical model):P (s1, . . . , st) = P (s1)P (s2|s1)P (s3|s2)P (s4|s3) · · ·s1s3s4s2• Repres entation in terms of state transitions (transitiondiagram). . .. . .. . .P (s1)s1s2s3P (s2|s1) P (s3|s2). . .P (s1)Tommi Jaakkola, MIT CSAIL 11Discrete Markov mode ls: properties• The values of each stare known as statesP1(·|·):0.00.51.00.50.5P (·):s1s2s3.51 1.5.5 .5.5.5010.5• When successive state transitions are governed by the same(one-step) transition probability matrix P1(st|st−1), theMarkov model is homogeneousTommi Jaakkola, MIT CSAIL 12Discrete Markov mode ls: properties• The values of each stare known as statesP1(·|·):0.00.51.00.50.5P (·):s1s2s3.51 1.5.5 .5.5.5010.5• When successive state transitions are governed by the same(one-step) transition probability matrix P1(st|st−1), theMarkov model is homogeneous• Example: a language modelThis → is → a → boring → . . .Is a homogeneous Markov model appropriate in this case?Tommi Jaakkola, MIT CSAIL 13Discrete Markov mode ls: properties• The values of each stare known as statesP1(·|·):0.00.51.00.50.5P (·):s1s2s3.51 1.5.5 .5.5.5010.5• When successive state transitions are governed by the same(one-step) transition probability matrix P1(st|st−1), theMarkov model is homogeneous• If after k transitions we can get from any state i to any otherstate j, the markov chain is ergodic.More precisely, the multi-step transition probabilities m ustsatisfy Pk(s|r) > 0 for all r, s, and some fixed kTommi Jaakkola, MIT CSAIL 14Discrete Markov mode ls: ML estimationS(1): 0010011001000101000001000011101101010100S(2): 0101111110100110101000001000000101011001l(S(1), S(2)) =Xi=1,2"log P (s(i)1) +40Xt=2log P1(s(i)t|s(i)t−1)#• ML estimates of the parameters (initial state and transitionprobabilities) are based on simple counts:ˆn(s) = # of times s1= sˆn(r, s) = # number of times r → sˆP (s) =ˆn(s)Ps0ˆn(s0)ˆP1(s|r) =ˆn(r, s)Ps0ˆn(r, s0)Tommi Jaakkola, MIT CSAIL 15Simple clustering example cont’d• Four binary sequences of length 40:1. 00100110010001010000010000111011010101002. 01011111101001101010000010000001010110013. 11010110000001101100100011011111010111014. 1101010111101011110111101101101101000101• We c an estimate a Markov model based on any subset of thesequences• We still need to derive the clustering metric based on theresulting transitition probabilities (dynamics)Tommi


View Full Document

MIT 6 867 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?