Review•Supervised v. unsup. v. “other”•Clustering (for understanding, for compression, or as input to another task)‣break into “similar” groups‣what is “similar”?‣use of spectral embedding‣mapping back to clusters in original space!!"# !!"!$ ! !"!$ !"#!"#$ !"%!"%$!!"#$!!"#!!"!$!!"!$!"#!"#$!! !"#$ " "#$!!"#%!"#&!"#'""#'"#&"#%1Review•k-means clustering‣alternating optimization; convergence‣initialization; multiple restarts; split / merge•soft k-means‣mixture of Gaussians model‣E-step, M-step‣relation to hard k-means‣connection to naïve Bayes‣(un)biasedness(a)0 0.5 100.51fig 9.5a from Bishop2Review•EM algorithm‣general strategy for MLE or MAP with hidden variables (in our case, Zij)‣we were in the middle of deriving soft k-means as an EM algorithm3Review: soft k-means“Soft” k-means•Find soft assignments:! •Update means:! •Possibly: update covariances! •Repeat284Deriving soft k-means!P(Xi | Zij = 1, ") = Gaussian(#j, !j)!P(Zij = 1 | ") = pj!P(Xi, Zi! | ") = !L = ln P(X | ") =46Deriving soft k-means‣P(Xi | Zij = 1, θ) = Gaussian(μj, ∑j)‣P(Zij = 1 | θ) = pj‣P(Xi, Zi⋅ | θ) = ‣L = ln P(X | θ) =Deriving soft k-means!P(Xi | Zij = 1, ") = Gaussian(#j, !j)!P(Zij = 1 | ") = pj!P(Xi, Zi! | ") = !L = ln P(X | ") =465f(x,y) = ln(ex+ey)!!"!!!"!!!!#"#!$476f(x,y) = ln(ex+ey)!!"!!!"!!!!#"#!$7In general•f(x1, x2, …) = ln(∑i exp(xi)) ≥for any probability distribution q:8Optimizing a bound•L(θ) = ln ∑ exp(ln P(X, Z=z | θ)) ≥‣for any distribution q = ⟨qz⟩•Maximizing L(θ) is hard•So, maximize L(θ, q) instead‣start w/ arbitrary q, max wrt θ‣then max wrt q to get a tighter bound‣repeatz9Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-..(./0+./1223456$7#8456#7$8456#7#8456$7$810Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-..(./0+./1223456$7#8456#7$8456#7#8456$7$8!!!" !#$# "!%$&#$&"$&!$&%$&'$&($&)$&*++,-.#/#0,-."/#010Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-*.//012(3/4506(748$&$$11Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-*.//012(3/4506(748$&"%12Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-*.//012(3/4506(748$&9$13Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-*.//012(3/4506(748#&#$14Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-*.//012(3/4506(748#&9%15Example!!!" !#$# "!%$$&$$'$&$#$&$#'(()*+,-*.//012(3/4506(748#&9"16Optimizing: q•L(θ, q) = ∑ qz ln P(X, Z=z | θ) – ∑ qz ln qzz z17For soft k-means•qz = P(Z=z | X, θ)18Simplifying the bound•L(θ, q) = ∑ qz ln P(X, Z=z | θ) – ∑ qz ln qzz z19Optimizing: μ•L(θ, q) = ∑∑ qij [ln pij + ln N(Xi | μj, Σj)] – H(q)i j20The EM algorithm•Want to maximize L(θ) = log P(X | θ)•Hidden variables Z, so that‣L(θ) = log ∑z P(X, Z = z | θ)•Use bound: for any distribution q, log(∑z exp(ln P(X, Z = z | θ))) ≥21The EM algorithm•Alternating optimization ‣of L(θ,q) = EZ~q [ln P(X, Z | θ)] – H(q)‣E-step:‣M-step:22Example: failure times•You’re GE, testing light bulbs to estimate failure rate / lifetime‣run torture test on 1000 bulbs for 1000 hrs‣data: 503 bulbs fail at times X1, X2, …, X503‣497 bulbs are still going after 1000 hrs•Or, you’re an MD running a 5-year study, estimating mortality rate due to Emacsitis‣of 1000 patients, 214 die at times X1, …, X214‣remaining 786 are alive at end of study23EM for survival analysis•Hidden data: when would remaining samples have failed, if we had been patient enough to watch that long?‣Zi = Xi for failed samples‣Zi ≥ Xi for remaining samples•P(Xi = x | θ) = θe–θx for x ≥ 024Exponential distribution•P(X = x | θ) = θe–θx (for x ≥ 0)0 0.5 1 1.5 2 2.5 300.511.522.53ZP(Z) theta = 1theta = 325Properties of exponential distribution•E(X | θ) =•P(Z = z | θ, Z ≥ X) =•E(Z | θ, Z ≥ X) =26EM algorithm for survival analysis•E-step: for each censored point, compute‣E(Zi | Xi) = •M-step: compute MLE‣with fully-observed data, MLE is:‣with censored data:27Fixed point•If there are K censored observations, EM converges to:•Note: it’s unusual to have closed-form expression for fixed point28More examples of EM•Regression / classification with missing input values•Learning parameters of Kalman filters•Learning params of hidden Markov models‣“forward-backward”, “Baum-Welsh”•Learning parameters of NL
View Full Document