Review•Multiclass logistic regression•Priors, conditional MAP logistic regression•Bayesian logistic regression‣MAP is not always typical of posterior‣posterior predictive can avoid overfitting!"!#!$ " $ #"!%!&!'"'&!!"!#" "#"!"""$!"$%"$&"$'#1Review•Finding posterior predictive distribution often requires numerical integration‣uniform sampling‣importance sampling‣parallel importance sampling•These are all Monte-Carlo algorithms‣another well-known MC algorithm coming up2Application: SLAMEliazar and Parr, IJCAI-033Parallel ISParallel IS•Final estimate:32Parallel IS•Pick N samples Xi from proposal Q(X)•If we knew Wi = P(Xi)/Q(Xi), could do IS•Instead, set !and, •Then: 314Parallel IS is biased0 1 2 300.511.522.53mean(weights)1 / mean(weights)E(mean(weights))E(¯W )=Z, but E(1/¯W ) !=1/Z in general5−2 −1 0 1 2−2−1012Q :(X, Y ) ∼ N(1, 1) θ ∼ U(−π, π)f(x, y, θ)=Q(x, y, θ)P (o =0.8 | x, y, θ)/Z6−2 −1 0 1 2−2−1012PosteriorE(X, Y, θ) = (0.496, 0.350, 0.084)7SLAM revisited•Uses a recursive version of parallel importance sampling: particle filter‣each sample (particle) = trajectory over time‣sampling extends trajectory by one step‣recursively update importance weights and renormalize‣resampling trick to avoid keeping lots of particles with low weights8Particle filter example9Monte-Carlo revisited•Recall: wanted•Would like to search for areas of high P(x)•But searching could bias our estimatesEP(g(X)) =!g(x)P (x)dx =!f(x)dx10Markov-Chain Monte Carlo•Randomized search procedure•Produces sequence of RVs X1, X2, …‣Markov chain: satisfies Markov property•If P(Xt) small, P(Xt+1) tends to be larger•As t → ∞, Xi ~ P(X)•As Δ → ∞, Xt+Δ ⊥ Xt11Markov chain12Stationary distribution13Markov-Chain Monte Carlo•As t → ∞, Xi ~ P(X); as Δ → ∞, Xt+Δ ⊥ Xt •For big enough t and Δ, an approximately i.i.d. sample from P(X) is‣{ Xt, Xt+Δ, Xt+2Δ, Xt+3Δ, … }•Can use i.i.d. sample to estimate EP(g(X))•Actually, don’t need independence:14Metropolis-Hastings•Way to design chain w/ stationary dist’n P(X)•Basic strategy: start from arbitrary X•Repeatedly tweak X to get X’‣If P(X’) ≥ P(X), move to X’‣If P(X’) ≪ P(X), stay at X‣In intermediate cases, randomize15Proposal distribution•Left open: what does “tweak” mean?•Parameter of MH: Q(X’ | X)•Good proposals explore quickly, but remain in regions of high P(X)•Optimal proposal?16MH algorithm•Initialize X1 arbitrarily•For t = 1, 2, …:‣Sample X’ ~ Q(X’ | Xt)‣Compute p =‣With probability min(1, p), set Xt+1 := X’‣else Xt+1 := Xt•Note: sequence X1, X2, … will usually contain duplicates17Acceptance rate•Want acceptance rate (avg p) to be large, so we don’t get big runs of the same X•Want Q(X’ | X) to move long distances (to explore quickly)•Tension between long moves and P(accept):18Mixing rate, mixing time•If we pick a good proposal, we will move rapidly around domain of P(X)•After a short time, won’t be able to tell where we started•This is short mixing time = # steps until we can’t tell which starting point we used•Mixing rate = 1 / (mixing time)19MH example−1−0.500.51−1−0.8−0.6−0.4−0.200.20.40.60.81012345YXf(X,Y)20MH example−1 −0.5 0 0.5 1−1−0.500.5121In example•g(x) = x2•True E(g(X)) = 0.28…•Proposal: •Acceptance rate 55–60%•After 1000 samples, minus burn-in of 100:final estimate 0.282361final estimate 0.271167final estimate 0.322270final estimate 0.306541final estimate 0.308716Q(x!| x)=N(x!| x, 0.252I)22Gibbs sampler•Special case of MH•Divide X into blocks of r.v.s B(1), B(2), …•Proposal Q:‣pick a block i uniformly‣sample XB(i) ~ P(XB(i) | X¬B(i))•Useful property: acceptance rate p = 123Gibbs example−0.8−0.6−0.4 −0.20 0.20.40.6 0.81 1.2−0.8−0.6−0.4−0.200.20.40.60.824Gibbs example−1.5−1 −0.5 0 0.51 1.5−1−0.500.5125Gibbs failure example−6−4 −202 46−5−4−3−2−101234526Relational learning•Linear regression, logistic regression: attribute-value learning‣set of i.i.d. samples from P(X, Y)•Not all data is like this‣an attribute is a property of a single entity‣what about properties of sets of entities?27Application: document clustering28Application: recommendations29Latent-variable models30Best-known LVM: PCA•Suppose Xij, Uik, Vjk all ~ Gaussian‣yields principal components analysis‣or probabilistic PCA‣or Bayesian
View Full Document