CMU CS 10601 - Estimating Parameters - D1875641

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10601> Estimating Parameters

DOC PREVIEW

CMU CS 10601 - Estimating Parameters

School name Carnegie Mellon University

Course Cs 10601- Introduction to Machine Learning

Pages 22

This preview shows page 1-2-21-22 out of 22 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 29, 2011 Today: • Gaussian Naïve Bayes • real-valued Xi’s • Brain image classification • Logistic regression Readings: Required: • Mitchell: “Naïve Bayes and Logistic Regression” (available on class website) Optional • Bishop 1.2.4 • Bishop 4.2 Estimating Parameters • Maximum Likelihood Estimate (MLE): choose θ that maximizes probability of observed data • Maximum a Posteriori (MAP) estimate: choose θ that is most probable given prior probability and the data2 Recently: • Bayes classifiers to learn P(Y|X) • MLE and MAP estimates for parameters of P • Conditional independence • Naïve Bayes à make Bayesian learning practical • Text classification Today: • Naïve Bayes and continuous variables Xi: • Gaussian Naïve Bayes classifier • Learn P(Y|X) directly • Logistic regression, Regularization, Gradient ascent • Naïve Bayes or Logistic Regression? • Generative vs. Discriminative classifiers What if we have continuous Xi ? Eg., image classification: Xi is real-valued ith pixel3 What if we have continuous Xi ? Eg., image classification: Xi is real-valued ith pixel Naïve Bayes requires P(Xi | Y=yk), but Xi is real (continuous) Common approach: assume P(Xi | Y=yk) follows a Normal (Gaussian) distribution Gaussian Distribution (also called “Normal”) p(x) is a probability density function, whose integral (not sum) is 14 What if we have continuous Xi ? Gaussian Naïve Bayes (GNB): assume Sometimes assume variance σ • is independent of Y (i.e., σi), • or independent of Xi (i.e., σk) • or both (i.e., σ) • Train Naïve Bayes (examples) for each value yk! estimate for each attribute Xi estimate • conditional mean , variance • Classify (Xnew) Gaussian Naïve Bayes Algorithm – continuous Xi (but still discrete Y) Q: how many parameters must we estimate?5 Estimating Parameters: Y discrete, Xi continuous Maximum likelihood estimates: !!!!!jth training example δ()=1 if (Yj=yk) else 0 ith feature kth class GNB Example: Classify a person’s cognitive state, based on brain image • reading a sentence or viewing a picture? • reading the word describing a “Tool” or “Building”? • answering the question, or getting confused?6 Y is the mental state (reading “house” or “bottle”) Xi are the voxel activities, this is a plot of the µ’s defining P(Xi | Y=“bottle”) fMRI activation high below average average Mean activations over all training examples for Y=“bottle” Classification task: is person viewing a “tool” or “building”? p4 p8 p6 p11 p5 p7 p10 p9 p2 p12 p3 p100.10.20.30.40.50.60.70.80.91ParticipantsClassification accuracystatistically significant p<0.05 Classification accuracy7 Where is information encoded in the brain? Accuracies of cubical 27-voxel classifiers centered at each significant voxel [0.7-0.8] Naïve Bayes: What you should know • Designing classifiers based on Bayes rule • Conditional independence – What it is – Why it’s important • Naïve Bayes assumption and its consequences – Which (and how many) parameters must be estimated under different generative models (different forms for P(X|Y) ) • and why this matters • How to train Naïve Bayes classifiers – MLE and MAP estimates – with discrete and/or continuous inputs Xi8 Questions to think about: • Can you use Naïve Bayes for a combination of discrete and real-valued Xi? • How can we easily model just 2 of n attributes as dependent? • What does the decision surface of a Naïve Bayes classifier look like? • How would you select a subset of Xi’s? Logistic Regression Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 29, 2011 Required reading: • Mitchell draft chapter (see course website) Recommended reading: • Ng and Jordan paper (see course website)9 Logistic Regression Idea: • Naïve Bayes allows computing P(Y|X) by learning P(Y) and P(X|Y) • Why not learn P(Y|X) directly? • Consider learning f: X à Y, where • X is a vector of real-valued features, < X1 … Xn > • Y is boolean • assume all Xi are conditionally independent given Y • model P(Xi | Y = yk) as Gaussian N(µik,σi) • model P(Y) as Bernoulli (π) • What does that imply about the form of P(Y|X)?10 Derive form for P(Y|X) for continuous Xi Very convenient! implies implies implies11 Very convenient! implies implies implies linear classification rule!12 Logistic function Logistic regression more generally!• Logistic regression when Y not boolean (but still discrete-valued). • Now y ∈ {y1 ... yR} : learn R-1 sets of weights for k<R for k=R13 Training Logistic Regression: MCLE • we have L training examples: • maximum likelihood estimate for parameters W • maximum conditional likelihood estimate Training Logistic Regression: MCLE • Choose parameters W=<w0, ... wn> to maximize conditional likelihood of training data • Training data D = • Data likelihood = • Data conditional likelihood = where14 Expressing Conditional Log Likelihood Maximizing Conditional Log Likelihood Good news: l(W) is concave function of W!Bad news: no closed-form solution to maximize l(W)!15 Maximize Conditional Log Likelihood: Gradient Ascent16 Maximize Conditional Log Likelihood: Gradient Ascent Gradient ascent algorithm: iterate until change < ε& For all i, repeat That’s all for M(C)LE. How about MAP? • One common approach is to define priors on W – Normal distribution, zero mean, identity covariance • Helps avoid very large weights and overfitting • MAP estimate • let’s assume Gaussian prior: W ~ N(0, σ)17 MLE vs MAP • Maximum conditional likelihood estimate • Maximum a posteriori estimate with prior W~N(0,σI) MAP estimates and Regularization • Maximum a posteriori estimate with prior W~N(0,σI) called a “regularization” term • helps reduce overfitting, especially when training data is sparse • keep weights nearer to zero (if P(W) is zero mean Gaussian prior), or whatever the prior suggests • used very frequently in

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-21-22 out of 22 pages.

CMU CS 10601 - Estimating Parameters

Sign up for free to view:

Please select your school