Review•Supervised v. unsup. v. “other”•Clustering (for understanding, for compression, or as input to another task)!break into “similar” groups!what is “similar”?!use of spectral embedding!mapping back to clusters in original space!!"# !!"!$ ! !"!$ !"#!"#$ !"%!"%$!!"#$!!"#!!"!$!!"!$!"#!"#$!! !"#$ " "#$!!"#%!"#&!"#'""#'"#&"#%1Review•k-means clustering!alternating optimization; convergence!initialization; multiple restarts; split / merge•soft k-means!mixture of Gaussians model!E-step, M-step!connection to hard k-means!connection to naïve Bayes!(un)biasedness(a)0 0.5 100.51fig 9.5a from Bishop2Review•EM algorithm!general strategy for MLE or MAP with hidden variables (in our case, Zij)!we were in the middle of deriving soft k-means as an EM algorithm3Review: soft k-means“Soft” k-means•Find soft assignments:! •Update means:! •Possibly: update covariances! •Repeat28•Possibly: update cluster weights4Deriving soft k-means!P(Xi | Zij = 1, ") = Gaussian(#j, !j)!P(Zij = 1 | ") = pj!P(Xi, Zi! | ") = !L = ln P(X | ") =46Deriving soft k-means!P(Xi | Zij = 1, ") = Gaussian(#j, !j)!P(Zij = 1 | ") = pj!P(Xi, Zi! | ") = !L = ln P(X | ") =Deriving soft k-means!P(Xi | Zij = 1, ") = Gaussian(#j, !j)!P(Zij = 1 | ") = pj!P(Xi, Zi! | ") = !L = ln P(X | ") =465f(x,y) = ln(ex+ey)!!"!!!"!!!!#"#!$476f(x,y) = ln(ex+ey)!!"!!!"!!!!#"#!$7In general•f(x1, x2, …) = ln(!i exp(xi)) "for any probability distribution q:8Optimizing a bound•L(") = ln ! exp(ln P(X, Z=z | ")) "!for any distribution q = "qz#•Maximizing L(") is hard•So, maximize L(", q) instead!start w/ arbitrary q, max wrt "!then max wrt q to get a tighter bound!repeatz9The distribution q•One element qz for each setting of z = •How many?! •Can we work with q efficiently?10Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-..(./0+./1223456$7#8456#7$8456#7#8456$7$811Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-*.//012(3/4506(748$&$$12Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-*.//012(3/4506(748$&"%13Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-*.//012(3/4506(748$&9$14Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-*.//012(3/4506(748#&#$15Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-*.//012(3/4506(748#&9%16Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-*.//012(3/4506(748#&9"17Optimizing: q•L(", q) = ! qz ln P(X, Z=z | ") – ! qz ln qzz z18For soft k-means•qz = P(Z=z | X, ")19Simplifying the bound•L(", q) = ! qz ln P(X, Z=z | ") – ! qz ln qzz z20Optimizing: #•L(", q) = !! qij [ln pij + ln N(Xi | #j, $j)] – H(q)i j21Optimizing: pj, $j•Won’t derive here, but:!max wrt p and $ are just like MLE!count each Xi w/ weight Eq(Zij)22The EM algorithm•Want to maximize L(") = log P(X | ")•Hidden variables Z, so that!L(") = log !z P(X, Z = z | ")•Use bound: for any distribution q, log(!z exp(ln P(X, Z = z | "))) "23The EM algorithm•Alternating optimization !of L(",q) = EZ~q [ln P(X, Z | ")] – H(q)!E-step:!M-step:24Example: failure times•You’re GE, testing light bulbs to estimate failure rate / lifetime!run torture test on 1000 bulbs for 1000 hrs!data: 503 bulbs fail at times X1, X2, …, X503!497 bulbs are still going after 1000 hrs•Or, you’re an MD running a 5-year study, estimating mortality rate due to Emacsitis!of 1000 patients, 214 die at times X1, …, X214!remaining 786 are alive at end of study25EM for survival analysis•Hidden data: when would remaining samples have failed, if we had been patient enough to watch that long?!Zi = Xi for failed samples!Zi " Xi for remaining samples•P(Xi = x | ") = "e–"x for x " 026Exponential distribution•P(X = x | ") = "e–"x (for x " 0)0 0.5 1 1.5 2 2.5 300.511.522.53ZP(Z) theta = 1theta = 327Properties of exponential distribution•E(X | ") =•P(Z = z | ", Z " X) =•E(Z | ", Z " X) =28EM algorithm for survival analysis•E-step: for each censored point, compute!E(Zi | Xi) = •M-step: compute MLE!with fully-observed data, MLE is:!with censored data:29Fixed point•If there are K censored observations, EM converges to:•Note: it’s unusual to have closed-form expression for fixed point30More examples of EM•Regression / classification with missing input values•Learning parameters of Kalman filters•Learning params of hidden Markov models!“forward-backward”, “Baum-Welsh”•Learning parameters of NL
View Full Document