DOC PREVIEW
MIT 6 867 - Machine learning: lecture 16

This preview shows page 1-2-23-24 out of 24 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine learning: lecture 16Tommi S. JaakkolaMIT AI [email protected]• Structured probability models– Markov models– Hidden markov modelsTommi Jaakkola, MIT AI Lab 2Markov chain: review• A first order (homogeneous) Markov chain:P (s)0. . . tP (s | s ) t−1P (s)0. . .. . .. . .1• The initial state s0is drawn from P0(s0). Successivestates are drawn from the one step transition probabilitiesP1(st+1|st)Tommi Jaakkola, MIT AI Lab 3Markov chain: properties tP (s | s ) t−1P (s)0. . .. . .. . .1s0→ s1→ s2→ . . .• If there exists a finite k such that any state i can lead toany other state j after exactly k steps, the markov chain isergodic:P (st+k= j|st= i) > 0 for all i, j and some finite kTommi Jaakkola, MIT AI Lab 4Markov chains• Problems we have to solve1. Prediction2. Estimation• Prediction: what is the probability distribution over thepossible states st+kat time t + k if we start from st= i?P1(st+1|st= i)P2(st+2|st= i) =Xst+1P1(st+1|st= i) P1(st+2|st+1)· · ·Pk(st+k|st= i) =Xst+k−1Pk−1(st+k−1|st= i) P1(st+k|st+k−1)where Pk(s0|s) is the k-step transition probability matrix.Tommi Jaakkola, MIT AI Lab 5Markov chain: estimation• We need to e stim ate the initial state distribution P0(s0) andthe transition probabilities P1(s0|s)• Estimation from L observed sequences of different lengthss(1)0→ s(1)1→ s(1)2→ . . . → s(1)n1. . .s(L)0→ s(L)1→ s(L)2→ . . . → s(L)nLMaximum likelihood estimates (observed fractions)ˆP0(s0= i) =1LLXl=1δ(s(l)0, i)where δ(x, y) = 1 if x = y and zero otherwiseTommi Jaakkola, MIT AI Lab 6Markov chain: estimations(1)0→ s(1)1→ s(1)2→ . . . → s(1)n1. . .s(L)0→ s(L)1→ s(L)2→ . . . → s(L)nL• The transition probabilities are obtained as observed fractionsof transitions out of a specific stateJoint estimate over successive statesˆPs,s0(s = i, s0= j) =1(PLl=1nl)LXl=1nl−1Xt=0δ(s(l)t, i)δ(s(l)t+1, j)and the transition probability estimatesˆP1(s0= j|s = i) =ˆPs,s0(s = i, s0= j)PkˆPs,s0(s = i, s0= k)Tommi Jaakkola, MIT AI Lab 7Markov chain: estimation• Can we simply estimate Markov chains from a single longsequence?s0→ s1→ s2→ . . . → sn– Ergodicity?– What about the initial state distributionˆP0(s0)?Tommi Jaakkola, MIT AI Lab 8Clustering by dynamics• We can cluster time course signals by means of comparingtheir dynamics, where the dynamics is captured by a Markovchain model– system behavior monitoring (anomaly detection)– biosequenciesetc.• There are still many ways of using the Markov chain modelsfor clustering (e.g., what is the clustering metric?)• The approach we follow here is to derive a c riterion fordetermining whether two (or more) sequences should be inthe same clusterTommi Jaakkola, MIT AI Lab 9Cluster criterion• How can we tell whether two arbitrary sequencesS(1)= {s(1)0, . . . , s(1)n1} and S(2)= {s(2)0, . . . , s(2)n2}should be in the same cluster?• We can compare (approximate) description lengths of eitherencoding the sequencies separately or jointlyDL(1)+ DL(2)><DL(1+2)where DL(1+2)uses the same Markov chain for bothsequencies while DL(1)and DL(2)are based on differentmodels.Tommi Jaakkola, MIT AI Lab 10Cluster criterion cont’d• Approximate description lengths:DL(1)+ DL(2)= − log P (S(1)|ˆθ1) +d2log(n1)− log P (S(2)|ˆθ2) +d2log(n2)DL(1+2)= − log P (S(1)|ˆθ) − log P (S(2)|ˆθ)+d2log(n1+ n2)where the maximum likelihood paramete r e stim atesˆθ1,ˆθ2,andˆθ each include the initial state distribution and thetransition probabilities; d = 3 for binary sequences.• We are essentially testing here whether the two sequencieshave the same first order Markov dynamicsTommi Jaakkola, MIT AI Lab 11Simple example• Four binary sequences of length 50:1. 0010011001000101000001000011101101010100...2. 0101111110100110101000001000000101011001...3. 1101011000000110110010001101111101011101...4. 1101010111101011110111101101101101000101...Evaluations:DL(1)+ DL(2)− DL(1+2)= 6.6 bitsDL(1+2)+ DL(3+4)− DL(1+2+3+4)= −0.9 bitsAgglomerative hierarchical clustering with Euclidean distancewould give (((2, 3), 4), 1)Tommi Jaakkola, MIT AI Lab 12Beyond Markov chains• Pote ntial problems with using Markov chains– if the state is continuous– if we cannot fully determine what the current state is (e.g.,due to noisy observations)– if the state is an abstraction and never directly observable• We need to augment the markov chain with a model thatrelates the states to observablesTommi Jaakkola, MIT AI Lab 13Hidden Markov models• A hidden Markov model (HMM) is model where we generatea sequence of outputs in addition to the Markov statesequences0↓x0→ s1↓x1→ s2↓x2→ . . .– number of states m– initial state distribution P0(s0)– state transition model P1(st+1|st)– output model Po(xt|st) (discrete or continuous)• This is a latent variable model in the sense that we will onlyobserve the outputs {x0, x1, . . . , xn}; the state sequenceremains “hidden”Tommi Jaakkola, MIT AI Lab 14HMM example• Two states 1 and 2; observations are tosses of unbiased coinsP0(s = 1) = 0.5, P0(s = 2) = 0.5P1(s0= 1|s = 1) = 0, P1(s0= 2|s = 1) = 1P1(s0= 1|s = 2) = 0, P1(s0= 2|s = 2) = 1Po(x = heads|s = 1) = 0.5, Po(x = tails|s = 1) = 0.5Po(x = heads|s = 2) = 0.5, Po(x = tails|s = 2) = 0.51 2• This model is unidentifiable in t he sense that the particularhidden state Markov chain has no effect on the observationsTommi Jaakkola, MIT AI Lab 15HMM example: biased coins• Two states 1 and 2; outputs are tosses of biased coinsP0(s = 1) = 0.5, P0(s = 2) = 0.5P1(s0= 1|s = 1) = 0, P1(s0= 2|s = 1) = 1P1(s0= 1|s = 2) = 0, P1(s0= 2|s = 2) = 1Po(x = heads|s = 1) = 0.25, Po(x = tails|s = 1) = 0.75Po(x = heads|s = 2) = 0.75, Po(x = tails|s = 2) = 0.251 2• What type of output sequences do we get from this HMMmodel?Tommi Jaakkola, MIT AI Lab 16HMM example• Continuous output model: x = [x1, x2], Po(x|s) is aGaussian with mean and covariance depending on theunderlying state s. Each state is initially e qually likely.1 234• How does this compare to a mixture of four Gaussians model?Tommi Jaakkola, MIT AI Lab 17HMM problems• There are several problems we have to solve1. How do we evaluate the probability that our modelgenerated the observation se quence {x0, x1, . . . , xn}?– forward-backward algorithm2. How do we uncover the most like ly hidden state sequencecorresponding to these observations?– dynamic programming3. How do we adapt the


View Full Document

MIT 6 867 - Machine learning: lecture 16

Download Machine learning: lecture 16
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Machine learning: lecture 16 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Machine learning: lecture 16 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?