Machine learning lecture 15 Tommi S Jaakkola MIT AI Lab tommi ai mit edu Topics Clustering Markov random walk and spectral clustering semi supervised clustering Structured probability models Markov models Hidden markov models next lecture Tommi Jaakkola MIT AI Lab 2 Spectral clustering review The spectral clustering method we define relies on a random walk representation over the points We construct this in three steps 1 a nearest neighbor graph 2 similarity weights on the edges Wij exp kxi xj k where Wii 1 and the weight is zero for non edges 6 5 4 3 2 1 3 transition probability matrix X Pij Wij Wij 0 j0 Tommi Jaakkola MIT AI Lab 0 1 2 4 2 0 2 4 6 3 Properties of the random walk If we start from i0 the distribution of points it that we end up in after t steps is given by i1 Pi0 i1 X i2 Pi0 i1 Pi1 i2 P 2 i0 i2 i1 i3 XX i1 Pi0 i1 Pi1 i2 Pi2 i3 P 3 i0 i3 i2 it P t i0 it where P t P P P t matrix products and ij denotes the i j component of the matrix Tommi Jaakkola MIT AI Lab 4 Random walk and clustering The distributions of points we end up in after t steps converge as t increases If the graph is connected the resulting distribution is independent of the starting point Even for large t the transition probabilities P t ij have a slightly higher probability of transitioning within clusters than across we want to recover this effect from eigenvalues vectors 6 5 4 3 2 1 0 1 2 4 Tommi Jaakkola MIT AI Lab 2 0 2 4 6 5 Eigenvalues vectors and spectral clustering Let W be the matrix with components Wij and D a diagonal P matrix such that Dii j Wij Then P D 1W To find out how P t behaves for large t it is useful to examine the eigen decomposition of the following symmetric matrix D 12 WD 21 1z1zT1 2z2zT2 nznzTn where the ordering is such that 1 2 n Tommi Jaakkola MIT AI Lab 6 Eigenvalues vectors cont d The symmetric matrix is related to P t since D 12 WD 21 D 12 WD 21 1 2 D P P D 12 This allows us to write the t step transition probability matrix in terms of the eigenvalues vectors of the symmetric matrix t 1 t 12 21 12 P D D WD D2 1 t T t T t T 12 D 1 z 1 z 1 2 z2 z2 n z n z n D 2 where 1 1 and P Tommi Jaakkola MIT AI Lab D 12 z1zT1 D 1 2 7 Eigenvalues vectors and spectral clustering We are interested in the largest correction to the asymptotic limit 1 t 12 t T 2 z2 z2 D 2 P P D Note z2zT2 ij z2iz2j and thus the largest correction term increases the probability of transitions between points that share the same sign of z2i and decreases transitions across points with different signs Binary spectral clustering we divide the points into clusters based on the sign of the elements of z2 z2j 0 cluster 1 otherwise cluster 0 Tommi Jaakkola MIT AI Lab 8 Spectral clustering example 6 6 5 5 4 4 3 3 2 2 1 1 0 0 1 1 2 3 2 1 0 Tommi Jaakkola MIT AI Lab 1 2 3 4 5 2 4 2 0 2 4 6 9 Spectral clustering example cont d 0 5 0 4 0 3 0 2 0 1 0 0 1 0 2 0 3 0 4 0 5 0 5 10 15 20 25 30 35 40 Components of the eigenvector corresponding to the second largest eigenvalue Tommi Jaakkola MIT AI Lab 10 Semi supervised clustering Let s assume we have some additional relevance information about the examples to be clustered xi Training example e g a text document y Relevance variable e g a word P y xi Relevance information e g word distribution where i 1 n We wish to cluster the documents into larger groups without loosing information about words contained in the documents documents with similar word frequencies should be merged into a single cluster Tommi Jaakkola MIT AI Lab 11 Semi supervised clustering cont d We cluster the examples xi on the basis of P y xi the predictive distributions For any cluster C we define the predictive word distribution based on randomly picking a document in the cluster 1 X P y j C P y j xi C i C C P C n Tommi Jaakkola MIT AI Lab 12 Semi supervised clustering cont d The distance between the clusters measures how much information we loose about the words if the clusters are merged Cl Ck I y cluster identity d Cl Ck n P y C1 C2 P y C2 P y C1 P y x1 Tommi Jaakkola MIT AI Lab 13 Semi supervised clustering cont d The distance between the clusters measures how much information we loose about the words if the clusters are merged Cl Ck d Cl Ck I y cluster identity n where I y cluster identity m X 1 P y j Cl P Cl P y j Cl log P Cl P Ck P y j Cl Ck j 1 m X P y j Ck P Ck P y j Ck log P y j Cl Ck j 1 Tommi Jaakkola MIT AI Lab 14 Semi supervised clustering example Suppose we have x1 y1 xn yn a set of labeled examples We can take the label as the relevance variable P y xi 1 if y yi and zero otherwise Tommi Jaakkola MIT AI Lab 15 Topics Structured probability models Markov models Hidden markov models next lecture Tommi Jaakkola MIT AI Lab 16 Markov models Often we want to model dynamical systems not just individual examples 1 2 3 4 Speech language processing Human behavior e g user modeling Modeling physical biological processes Stock market etc We need to define a class of probability models that we can estimate from observed behavior of the dynamical system Tommi Jaakkola MIT AI Lab 17 Markov chain definition We define here a finite state Markov chain stochastic finite state machine 1 States s 1 m where m is finite 2 Starting state The starting state s0 may be fixed or drawn from some a priori distribution P0 s0 3 Transitions dynamics we define how the system transitions from the current state st to the next state st 1 The transitions satisfy the first order Markov property P st 1 st s t 1 z s 0 P1 st 1 st past Tommi Jaakkola MIT AI Lab 18 Markov chain cont d The resulting stochastic system generates a sequence of states s0 s1 s2 where s0 is drawn from P0 s0 and each st 1 from one step transition probabilities P1 st 1 st We can represent the Markov chain as a state transition diagram P1 st st 1 P0 s Tommi …
View Full Document
Unlocking...