Review•Linear separability (and use of features)•Class probabilities for linear discriminants‣sigmoid (logistic) function•Applications: USPS, fMRI!! !" # " !##$"#$!#$%#$&'()φ1φ20 0.5 100.51figure from book1Review•Generative vs. discriminative‣maximum conditional likelihood•Logistic regression•Weight space‣each example adds a penalty to all weight vectors that misclassify it‣penalty is approximately piecewise linear!! !" !# $ # " !$%#&"'!()*+!,-./0)1/2/*+2Example!! !"#" !$##%!#%&#%'#%("3–log(P(Y1..3 | X1..3, W))!"!#!$ !% " % $ & '!&!$!%"%4Generalization: multiple classes•One weight vector per class: Y ∈ {1,2,…,C}‣P(Y=k) = ‣Zk = •In 2-class case:5Multiclass example!6 !4 !2 0 2 4 6!6!4!20246figure from book6Priors and conditional MAP•P(Y | X, W) = ‣Z = •As in linear regression, can put prior on W‣common priors: L2 (ridge), L1 (sparsity)•maxw P(W=w | X, Y)7Software•Logistic regression software is easily available: most stats packages provide it‣e.g., glm function in R‣or, http://www.cs.cmu.edu/~ggordon/IRLS-example/•Most common algorithm: Newton’s method on log-likelihood (or L2-penalized version)‣called “iteratively reweighted least squares”‣for L1, slightly harder (less software available)8Historical application: Fisher iris datapetal lengthP(I. virginica)9Bayesian regression•In linear and logistic regression, we’ve looked at ‣conditional MLE: maxw P(Y | X, w)‣conditional MAP: maxw P(W=w | X, Y)•But of course, a true Bayesian would turn up nose at both‣why?10Sample from posterior!"!#!$ " $ #"!%!&!'"'&11Predictive distribution!!"!#" "#"!"""$!"$%"$&"$'#12Overfitting•Overfit: training likelihood ≫ test likelihood‣often a result of overconfidence•Overfitting is an indicator that the MLE or MAP approximation is a bad one•Bayesian inference rarely overfits‣may still lead to bad results for other reasons!‣e.g., not enough data, bad model class, …13So, we want the predictive distribution•Most of the time…‣Graphical model is big and highly connected‣Variables are high-arity or continuous•Can’t afford exact inference‣Inference reduces to numerical integration (and/or summation)‣We’ll look at randomized algorithms14Numerical integration−1−0.500.51−1−0.8−0.6−0.4−0.200.20.40.60.81012345YXf(X,Y)152D is 2 easy!•We care about high-D problems•Often, much of the mass is hidden in a tiny fraction of the volume‣simultaneously try to discover it and estimate amount16Application: SLAM17Integrals in multi-million-DEliazar and Parr, IJCAI-0318Simple 1D problem−1 −0.5 0 0.5 101020304050607019Uniform sampling−1 −0.5 0 0.5 101020304050607020Uniform sampling•So, is desired integral•But standard deviation can be big•Can reduce it by averaging many samples•But only at rate 1/√NE(f(X)) =21Importance sampling•Instead of X ~ uniform, use X ~ Q(x)•Q =•Should have Q(x) large where f(x) is large•Problem:22Importance sampling•Instead of X ~ uniform, use X ~ Q(x)•Q =•Should have Q(x) large where f(x) is large•Problem:EQ(f(X)) =!Q(x)f(x)dx22Importance samplingh(x) ≡ f(x)/Q(x)EQ(h(X)) =!Q(x)h(x)dx=!Q(x)f(x)/Q(x)dx=!f(x)dx23Importance sampling•So, take samples of h(X) instead of f(X)•Wi = 1/Q(Xi) is importance weight•Q = 1/V yields uniform sampling24Importance sampling−1 −0.5 0 0.5 101020304050607025Variance•How does this help us control variance?•Suppose:‣f big‣Q small•Then h = f/Q:•Variance of each weighted sample is •Optimal Q?26Importance sampling, part II•Suppose we want•Pick N samples Xi from proposal Q(X)•Average Wi g(Xi), where importance weight is‣Wi = !f(x)dx =!P (x)g(x)dx = EP(g(X))27Importance sampling, part II•Suppose we want•Pick N samples Xi from proposal Q(X)•Average Wi g(Xi), where importance weight is‣Wi = !f(x)dx =!P (x)g(x)dx = EP(g(X))EQ(Wg(X)) =!Q(x)[P (x)/Q(x)]g(x)dx =!P (x)g(x)dx27Two variants of IS•Same algorithm, different terminology‣want ∫ f(x) dx vs. EP(f(X))‣W = 1/Q vs. W = P/Q28Parallel importance sampling•Suppose we want•But P(x) is unnormalized (e.g., represented by a factor graph)—know only Z P(x)!f(x)dx =!P (x)g(x)dx = EP(g(X))29Parallel IS•Pick N samples Xi from proposal Q(X)•If we knew Wi = P(Xi)/Q(Xi), could do IS•Instead, set ‣and, •Then: 30Parallel IS•Final estimate:31Parallel IS is biased0 1 2 300.511.522.53mean(weights)1 / mean(weights)E(mean(weights))E(¯W )=Z, but E(1/¯W ) !=1/Z in general32−2 −1 0 1 2−2−1012Q :(X, Y ) ∼ N(1, 1) θ ∼ U(−π, π)f(x, y, θ)=Q(x, y, θ)P (o =0.8 | x, y, θ)/Z33−2 −1 0 1 2−2−1012PosteriorE(X, Y, θ) = (0.496, 0.350, 0.084)•Posterior34SLAM revisited•Uses a recursive version of parallel importance sampling: particle filter‣each sample (particle) = trajectory over time‣sampling extends trajectory by one step‣recursively update importance weights and renormalize‣resampling trick to avoid keeping lots of particles with low weights35Particle filter
View Full Document