DOC PREVIEW
CMU CS 10601 - Estimating Parameters

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 29, 2011 Today: • Gaussian Naïve Bayes • real-valued Xi’s • Brain image classification • Logistic regression Readings: Required: • Mitchell: “Naïve Bayes and Logistic Regression” (available on class website) Optional • Bishop 1.2.4 • Bishop 4.2 Estimating Parameters • Maximum Likelihood Estimate (MLE): choose θ that maximizes probability of observed data • Maximum a Posteriori (MAP) estimate: choose θ that is most probable given prior probability and the data2 Recently: • Bayes classifiers to learn P(Y|X) • MLE and MAP estimates for parameters of P • Conditional independence • Naïve Bayes à make Bayesian learning practical • Text classification Today: • Naïve Bayes and continuous variables Xi: • Gaussian Naïve Bayes classifier • Learn P(Y|X) directly • Logistic regression, Regularization, Gradient ascent • Naïve Bayes or Logistic Regression? • Generative vs. Discriminative classifiers What if we have continuous Xi ? Eg., image classification: Xi is real-valued ith pixel3 What if we have continuous Xi ? Eg., image classification: Xi is real-valued ith pixel Naïve Bayes requires P(Xi | Y=yk), but Xi is real (continuous) Common approach: assume P(Xi | Y=yk) follows a Normal (Gaussian) distribution Gaussian Distribution (also called “Normal”) p(x) is a probability density function, whose integral (not sum) is 14 What if we have continuous Xi ? Gaussian Naïve Bayes (GNB): assume Sometimes assume variance σ • is independent of Y (i.e., σi), • or independent of Xi (i.e., σk) • or both (i.e., σ) • Train Naïve Bayes (examples) for each value yk! estimate for each attribute Xi estimate • conditional mean , variance • Classify (Xnew) Gaussian Naïve Bayes Algorithm – continuous Xi (but still discrete Y) Q: how many parameters must we estimate?5 Estimating Parameters: Y discrete, Xi continuous Maximum likelihood estimates: !!!!!jth training example δ()=1 if (Yj=yk) else 0 ith feature kth class GNB Example: Classify a person’s cognitive state, based on brain image • reading a sentence or viewing a picture? • reading the word describing a “Tool” or “Building”? • answering the question, or getting confused?6 Y is the mental state (reading “house” or “bottle”) Xi are the voxel activities, this is a plot of the µ’s defining P(Xi | Y=“bottle”) fMRI activation high below average average Mean activations over all training examples for Y=“bottle” Classification task: is person viewing a “tool” or “building”? p4 p8 p6 p11 p5 p7 p10 p9 p2 p12 p3 p100.10.20.30.40.50.60.70.80.91ParticipantsClassification accuracystatistically significant p<0.05 Classification accuracy7 Where is information encoded in the brain? Accuracies of cubical 27-voxel classifiers centered at each significant voxel [0.7-0.8] Naïve Bayes: What you should know • Designing classifiers based on Bayes rule • Conditional independence – What it is – Why it’s important • Naïve Bayes assumption and its consequences – Which (and how many) parameters must be estimated under different generative models (different forms for P(X|Y) ) • and why this matters • How to train Naïve Bayes classifiers – MLE and MAP estimates – with discrete and/or continuous inputs Xi8 Questions to think about: • Can you use Naïve Bayes for a combination of discrete and real-valued Xi? • How can we easily model just 2 of n attributes as dependent? • What does the decision surface of a Naïve Bayes classifier look like? • How would you select a subset of Xi’s? Logistic Regression Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 29, 2011 Required reading: • Mitchell draft chapter (see course website) Recommended reading: • Ng and Jordan paper (see course website)9 Logistic Regression Idea: • Naïve Bayes allows computing P(Y|X) by learning P(Y) and P(X|Y) • Why not learn P(Y|X) directly? • Consider learning f: X à Y, where • X is a vector of real-valued features, < X1 … Xn > • Y is boolean • assume all Xi are conditionally independent given Y • model P(Xi | Y = yk) as Gaussian N(µik,σi) • model P(Y) as Bernoulli (π) • What does that imply about the form of P(Y|X)?10 Derive form for P(Y|X) for continuous Xi Very convenient! implies implies implies11 Very convenient! implies implies implies linear classification rule!12 Logistic function Logistic regression more generally!• Logistic regression when Y not boolean (but still discrete-valued). • Now y ∈ {y1 ... yR} : learn R-1 sets of weights for k<R for k=R13 Training Logistic Regression: MCLE • we have L training examples: • maximum likelihood estimate for parameters W • maximum conditional likelihood estimate Training Logistic Regression: MCLE • Choose parameters W=<w0, ... wn> to maximize conditional likelihood of training data • Training data D = • Data likelihood = • Data conditional likelihood = where14 Expressing Conditional Log Likelihood Maximizing Conditional Log Likelihood Good news: l(W) is concave function of W!Bad news: no closed-form solution to maximize l(W)!15 Maximize Conditional Log Likelihood: Gradient Ascent16 Maximize Conditional Log Likelihood: Gradient Ascent Gradient ascent algorithm: iterate until change < ε& For all i, repeat That’s all for M(C)LE. How about MAP? • One common approach is to define priors on W – Normal distribution, zero mean, identity covariance • Helps avoid very large weights and overfitting • MAP estimate • let’s assume Gaussian prior: W ~ N(0, σ)17 MLE vs MAP • Maximum conditional likelihood estimate • Maximum a posteriori estimate with prior W~N(0,σI) MAP estimates and Regularization • Maximum a posteriori estimate with prior W~N(0,σI) called a “regularization” term • helps reduce overfitting, especially when training data is sparse • keep weights nearer to zero (if P(W) is zero mean Gaussian prior), or whatever the prior suggests • used very frequently in


View Full Document

CMU CS 10601 - Estimating Parameters

Documents in this Course
lecture

lecture

40 pages

Problem

Problem

12 pages

lecture

lecture

36 pages

Lecture

Lecture

31 pages

Review

Review

32 pages

Lecture

Lecture

11 pages

Lecture

Lecture

18 pages

Notes

Notes

10 pages

Boosting

Boosting

21 pages

review

review

21 pages

review

review

28 pages

Lecture

Lecture

31 pages

lecture

lecture

52 pages

Review

Review

26 pages

review

review

29 pages

Lecture

Lecture

37 pages

Lecture

Lecture

35 pages

Boosting

Boosting

17 pages

Review

Review

35 pages

lecture

lecture

32 pages

Lecture

Lecture

28 pages

Lecture

Lecture

30 pages

lecture

lecture

29 pages

leecture

leecture

41 pages

lecture

lecture

34 pages

review

review

38 pages

review

review

31 pages

Lecture

Lecture

41 pages

Lecture

Lecture

15 pages

Lecture

Lecture

21 pages

Lecture

Lecture

38 pages

Notes

Notes

37 pages

lecture

lecture

29 pages

Load more
Download Estimating Parameters
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Estimating Parameters and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Estimating Parameters 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?