DOC PREVIEW
MIT 6 867 - Machine learning

This preview shows page 1-2-19-20 out of 20 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine learning: lecture 15Tommi S. JaakkolaMIT AI [email protected]• Clustering– Markov random walk and spectral clustering– semi-supervised clustering• Structured probability models– Markov models– Hidden markov models (next lecture)Tommi Jaakkola, MIT AI Lab 2Spectral clustering: review• The spectral clustering method we define relies on a randomwalk represe ntation over the points. We construct this inthree steps1. a nearest neighbor graph2. similarity weights on the edges:Wij= exp{−βkxi− xjk}where Wii= 1 and the weight iszero for non-edges.3. transition probability matrixPij= Wij/Xj0Wij0−4 −2 0 2 4 6−2−10123456Tommi Jaakkola, MIT AI Lab 3Properties of the random walk• If we start from i0, the distribution of points itthat we endup in after t steps is given byi1∼ Pi0i1,i2∼Xi1Pi0,i1Pi1i2= [P2]i0i2,i3∼Xi1Xi2Pi0,i1Pi1i2Pi2i3= [P3]i0i3,· · ·it∼ [Pt]i0itwhere Pt= P P . . . P (t matrix products) and [·]ijdenotesthe i, j component of the matrix.Tommi Jaakkola, MIT AI Lab 4Random walk and clustering• The distributions of points we end up in after t steps convergeas t increases. If the graph is connected, the resultingdistribution is independent of the starting pointEven for large t, the transition probabilities [Pt]ijhave aslightly higher probability of transitioning within “clusters”than across; we want to recover this effect fromeigenvalues/vectors−4 −2 0 2 4 6−2−10123456Tommi Jaakkola, MIT AI Lab 5Eigenvalues/vectors and spectral clustering• Let W be the matrix with components Wijand D a diagonalmatrix suc h that Dii=PjWij. ThenP = D−1W• To find out how Ptbehaves for large t it is useful to examinethe eigen-decomposition of the following symmetric matrixD−12W D−12= λ1z1zT1+ λ2z2zT2+ . . . + λnznzTnwhere the ordering is such that |λ1| ≥ |λ2| ≥ . . . ≥ |λn|.Tommi Jaakkola, MIT AI Lab 6Eigenvalues/vectors cont’d• The sym me tric matrix is related to Ptsince(D−12W D−12) · · · (D−12W D−12) = D12(P · · · P ) D−12This allows us to write the t step transition probability matrixin terms of the eigenvalues/vectors of the symmetric matrixPt= D−12D−12W D−12tD12= D−12λt1z1zT1+ λt2z2zT2+ . . . + λtnznzTnD12where λ1= 1 andP∞= D−12z1zT1D12Tommi Jaakkola, MIT AI Lab 7Eigenvalues/vectors and spectral clustering• We are interested in the largest correction to the asymptoticlimitPt≈ P∞+ D−12λt2z2zT2D12Note: [z2zT2]ij= z2iz2jand thus the largest correction termincreases the probability of transitions between points thatshare the same sign of z2iand decreases transitions acrosspoints with different signs• Binary spectral clustering: we divide the points into clustersbased on the sign of the elements of z2z2j> 0 ⇒ cluster 1, otherwise c luster 0Tommi Jaakkola, MIT AI Lab 8Spectral clustering: example−3 −2 −1 0 1 2 3 4 5−2−10123456−4 −2 0 2 4 6−2−10123456Tommi Jaakkola, MIT AI Lab 9Spectral clustering: example cont’d0 5 10 15 20 25 30 35 40−0.5−0.4−0.3−0.2−0.100.10.20.30.40.5Components of the eigenvector corresponding to the secondlargest eigenvalueTommi Jaakkola, MIT AI Lab 10Semi-supervised clustering• Let’s assume we have some additional relevance informationabout the examples to be clustere dxiTraining example (e.g., a text document)y Relevance variable (e.g., a word)P (y|xi) Relevance information (e.g., word distribution)where i = 1, . . . , n .• We wis h to cluster the documents into larger groups withoutloosing information about words contained in the documentsdocuments with similar word frequencies should be mergedinto a single clusterTommi Jaakkola, MIT AI Lab 11Semi-supervised clustering cont’d• We cluster the examples {xi} on the basis of {P (y|xi)}, thepredictive distributions• For any cluster C we define the predictive word distributionbased on randomly picking a doc ume nt in the clusterˆP (y = j|C) =1|C|Xi∈CP (y = j|xi)ˆP (C) =|C|nTommi Jaakkola, MIT AI Lab 12Semi-supervised clustering cont’d• The distance between the clusters measures how muchinformation we loose about the words if the clusters aremergedd(Cl, Ck) =|Cl| + |Ck|n· I( y ; cluste r identity ). . .P(y|x1)P(y|C1)P(y|C2)P(y|C1 + C2)Tommi Jaakkola, MIT AI Lab 13Semi-supervised clustering cont’d• The distance between the clusters measures how muchinformation we loose about the words if the clusters aremergedd(Cl, Ck) =|Cl| + |Ck|n· I( y ; cluste r identity )whereI( y ; cluste r identity ) =1ˆP (Cl) +ˆP (Ck)ˆP (Cl)mXj=1ˆP (y = j|Cl) logˆP (y = j|Cl)ˆP (y = j|Cl∪ Ck)+ˆP (Ck)mXj=1ˆP (y = j|Ck) logˆP (y = j|Ck)ˆP (y = j|Cl∪ Ck)Tommi Jaakkola, MIT AI Lab 14Semi-supervised clustering: example• Suppose we have a set of labeled examples(x1, y1), . . . , (xn, yn)• We can take the label as the relevance variable.P (y|xi) = 1, if y = yiand zero otherwiseTommi Jaakkola, MIT AI Lab 15Topics• Structured probability models– Markov models– Hidden markov models (next lecture)Tommi Jaakkola, MIT AI Lab 16Markov models• Often we want to model dynamical system s, not justindividual examples1. Speech/language processing2. Human behavior (e.g., user modeling)3. Modeling physical/biological processes4. Stock market etc.• We need to define a class of probability models that we canestimate from observed behavior of the dynamical systemTommi Jaakkola, MIT AI Lab 17Markov chain: definition• We define here a finite state Markov chain (stochastic finitestate machine)1. States: s ∈ {1, . . . , m}, where m is finite.2. Starting state: The starting state s0may be fixed or drawnfrom som e a priori distribution P0(s0).3. Transitions (dynamics): we define how the systemtransitions from the current state stto the next statest+1The transitions satisfy the first order Markov property:P (st+1|st, st−1, . . . , s0|{z }past) = P1(st+1|st)Tommi Jaakkola, MIT AI Lab 18Markov chain cont’d• The resulting stochastic system generates a sequence ofstatess0→ s1→ s2→ . . .where s0is drawn from P0(s0) and each st+1from one steptransition probabilities P1(st+1|st)• We can represent the Markov chain as a state transitiondiagramP (s)0. . . tP (s | s ) t-1P (s)0. . .. . .. . .1Tommi Jaakkola, MIT AI Lab 19Markov chain: example• The state s correspond to words in a sentence• The transitions are defined in terms of succ ess ive words in asentenceExample: a particular realization of the state sequences0→ s1→


View Full Document

MIT 6 867 - Machine learning

Download Machine learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Machine learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Machine learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?