DOC PREVIEW
CMU CS 10601 - review

This preview shows page 1-2-3-18-19-36-37-38 out of 38 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Review•Linear separability (and use of features)•Class probabilities for linear discriminants‣sigmoid (logistic) function•Applications: USPS, fMRI!! !" # " !##$"#$!#$%#$&'()φ1φ20 0.5 100.51figure from book1Review•Generative vs. discriminative‣maximum conditional likelihood•Logistic regression•Weight space‣each example adds a penalty to all weight vectors that misclassify it‣penalty is approximately piecewise linear!! !" !# $ # " !$%#&"'!()*+!,-./0)1/2/*+2Example!! !"#" !$##%!#%&#%'#%("3–log(P(Y1..3 | X1..3, W))!"!#!$ !% " % $ & '!&!$!%"%4Generalization: multiple classes•One weight vector per class: Y ∈ {1,2,…,C}‣P(Y=k) = ‣Zk = •In 2-class case:5Multiclass example!6 !4 !2 0 2 4 6!6!4!20246figure from book6Priors and conditional MAP•P(Y | X, W) = ‣Z = •As in linear regression, can put prior on W‣common priors: L2 (ridge), L1 (sparsity)•maxw P(W=w | X, Y)7Software•Logistic regression software is easily available: most stats packages provide it‣e.g., glm function in R‣or, http://www.cs.cmu.edu/~ggordon/IRLS-example/•Most common algorithm: Newton’s method on log-likelihood (or L2-penalized version)‣called “iteratively reweighted least squares”‣for L1, slightly harder (less software available)8Historical application: Fisher iris datapetal lengthP(I. virginica)9Bayesian regression•In linear and logistic regression, we’ve looked at ‣conditional MLE: maxw P(Y | X, w)‣conditional MAP: maxw P(W=w | X, Y)•But of course, a true Bayesian would turn up nose at both‣why?10Sample from posterior!"!#!$ " $ #"!%!&!'"'&11Predictive distribution!!"!#" "#"!"""$!"$%"$&"$'#12Overfitting•Overfit: training likelihood ≫ test likelihood‣often a result of overconfidence•Overfitting is an indicator that the MLE or MAP approximation is a bad one•Bayesian inference rarely overfits‣may still lead to bad results for other reasons!‣e.g., not enough data, bad model class, …13So, we want the predictive distribution•Most of the time…‣Graphical model is big and highly connected‣Variables are high-arity or continuous•Can’t afford exact inference‣Inference reduces to numerical integration (and/or summation)‣We’ll look at randomized algorithms14Numerical integration−1−0.500.51−1−0.8−0.6−0.4−0.200.20.40.60.81012345YXf(X,Y)152D is 2 easy!•We care about high-D problems•Often, much of the mass is hidden in a tiny fraction of the volume‣simultaneously try to discover it and estimate amount16Application: SLAM17Integrals in multi-million-DEliazar and Parr, IJCAI-0318Simple 1D problem−1 −0.5 0 0.5 101020304050607019Uniform sampling−1 −0.5 0 0.5 101020304050607020Uniform sampling•So, is desired integral•But standard deviation can be big•Can reduce it by averaging many samples•But only at rate 1/√NE(f(X)) =21Importance sampling•Instead of X ~ uniform, use X ~ Q(x)•Q =•Should have Q(x) large where f(x) is large•Problem:22Importance sampling•Instead of X ~ uniform, use X ~ Q(x)•Q =•Should have Q(x) large where f(x) is large•Problem:EQ(f(X)) =!Q(x)f(x)dx22Importance samplingh(x) ≡ f(x)/Q(x)EQ(h(X)) =!Q(x)h(x)dx=!Q(x)f(x)/Q(x)dx=!f(x)dx23Importance sampling•So, take samples of h(X) instead of f(X)•Wi = 1/Q(Xi) is importance weight•Q = 1/V yields uniform sampling24Importance sampling−1 −0.5 0 0.5 101020304050607025Variance•How does this help us control variance?•Suppose:‣f big‣Q small•Then h = f/Q:•Variance of each weighted sample is •Optimal Q?26Importance sampling, part II•Suppose we want•Pick N samples Xi from proposal Q(X)•Average Wi g(Xi), where importance weight is‣Wi = !f(x)dx =!P (x)g(x)dx = EP(g(X))27Importance sampling, part II•Suppose we want•Pick N samples Xi from proposal Q(X)•Average Wi g(Xi), where importance weight is‣Wi = !f(x)dx =!P (x)g(x)dx = EP(g(X))EQ(Wg(X)) =!Q(x)[P (x)/Q(x)]g(x)dx =!P (x)g(x)dx27Two variants of IS•Same algorithm, different terminology‣want ∫ f(x) dx vs. EP(f(X))‣W = 1/Q vs. W = P/Q28Parallel importance sampling•Suppose we want•But P(x) is unnormalized (e.g., represented by a factor graph)—know only Z P(x)!f(x)dx =!P (x)g(x)dx = EP(g(X))29Parallel IS•Pick N samples Xi from proposal Q(X)•If we knew Wi = P(Xi)/Q(Xi), could do IS•Instead, set ‣and, •Then: 30Parallel IS•Final estimate:31Parallel IS is biased0 1 2 300.511.522.53mean(weights)1 / mean(weights)E(mean(weights))E(¯W )=Z, but E(1/¯W ) !=1/Z in general32−2 −1 0 1 2−2−1012Q :(X, Y ) ∼ N(1, 1) θ ∼ U(−π, π)f(x, y, θ)=Q(x, y, θ)P (o =0.8 | x, y, θ)/Z33−2 −1 0 1 2−2−1012PosteriorE(X, Y, θ) = (0.496, 0.350, 0.084)•Posterior34SLAM revisited•Uses a recursive version of parallel importance sampling: particle filter‣each sample (particle) = trajectory over time‣sampling extends trajectory by one step‣recursively update importance weights and renormalize‣resampling trick to avoid keeping lots of particles with low weights35Particle filter


View Full Document

CMU CS 10601 - review

Documents in this Course
lecture

lecture

40 pages

Problem

Problem

12 pages

lecture

lecture

36 pages

Lecture

Lecture

31 pages

Review

Review

32 pages

Lecture

Lecture

11 pages

Lecture

Lecture

18 pages

Notes

Notes

10 pages

Boosting

Boosting

21 pages

review

review

21 pages

review

review

28 pages

Lecture

Lecture

31 pages

lecture

lecture

52 pages

Review

Review

26 pages

review

review

29 pages

Lecture

Lecture

37 pages

Lecture

Lecture

35 pages

Boosting

Boosting

17 pages

Review

Review

35 pages

lecture

lecture

32 pages

Lecture

Lecture

28 pages

Lecture

Lecture

30 pages

lecture

lecture

29 pages

leecture

leecture

41 pages

lecture

lecture

34 pages

review

review

31 pages

Lecture

Lecture

41 pages

Lecture

Lecture

15 pages

Lecture

Lecture

21 pages

Lecture

Lecture

38 pages

Notes

Notes

37 pages

lecture

lecture

29 pages

Load more
Download review
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view review and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view review 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?