1 Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 27, 2011 Today: • Naïve Bayes – Big Picture • Logistic regression • Gradient ascent • Generative – discriminative classifiers Readings: Required: • Mitchell: “Naïve Bayes and Logistic Regression” (see class website) Optional • Ng and Jordan paper (class website) Gaussian Naïve Bayes – Big Picture Consider boolean Y, continuous Xi . Assume P(Y=1)=0.52 What is the minimum possible error? Best case: • conditional independence assumption is satistied • we know P(Y), P(X|Y) perfectly (e.g., infinite training data) Logistic Regression Idea: • Naïve Bayes allows computing P(Y|X) by learning P(Y) and P(X|Y) • Why not learn P(Y|X) directly?3 • Consider learning f: X Y, where • X is a vector of real-valued features, < X1 … Xn > • Y is boolean • assume all Xi are conditionally independent given Y • model P(Xi | Y = yk) as Gaussian N(µik,σi) • model P(Y) as Bernoulli (π) • What does that imply about the form of P(Y|X)? Derive form for P(Y|X) for continuous Xi4 Very convenient! implies implies implies Very convenient! implies implies implies linear classification rule!5 Logistic function6 Logistic regression more generally!• Logistic regression when Y not boolean (but still discrete-valued). • Now y ∈ {y1 ... yR} : learn R-1 sets of weights for k<R for k=R Training Logistic Regression: MCLE • we have L training examples: • maximum likelihood estimate for parameters W • maximum conditional likelihood estimate7 Training Logistic Regression: MCLE • Choose parameters W=<w0, ... wn> to maximize conditional likelihood of training data • Training data D = • Data likelihood = • Data conditional likelihood = where Expressing Conditional Log Likelihood8 Maximizing Conditional Log Likelihood Good news: l(W) is concave function of W!Bad news: no closed-form solution to maximize l(W)!9 Maximize Conditional Log Likelihood: Gradient Ascent Maximize Conditional Log Likelihood: Gradient Ascent Gradient ascent algorithm: iterate until change < ε$ For all i, repeat10 That’s all for M(C)LE. How about MAP? • One common approach is to define priors on W – Normal distribution, zero mean, identity covariance • Helps avoid very large weights and overfitting • MAP estimate • let’s assume Gaussian prior: W ~ N(0, σ) MLE vs MAP • Maximum conditional likelihood estimate • Maximum a posteriori estimate with prior W~N(0,σI)11 MAP estimates and Regularization • Maximum a posteriori estimate with prior W~N(0,σI) called a “regularization” term • helps reduce overfitting, especially when training data is sparse • keep weights nearer to zero (if P(W) is zero mean Gaussian prior), or whatever the prior suggests • used very frequently in Logistic Regression • Consider learning f: X Y, where • X is a vector of real-valued features, < X1 … Xn > • Y is boolean • assume all Xi are conditionally independent given Y • model P(Xi | Y = yk) as Gaussian N(µik,σi) • model P(Y) as Bernoulli (π) • Then P(Y|X) is of this form, and we can directly estimate W • Furthermore, same holds if the Xi are boolean • trying proving that to yourself The Bottom Line12 Generative vs. Discriminative Classifiers Training classifiers involves estimating f: X Y, or P(Y|X) Generative classifiers (e.g., Naïve Bayes) • Assume some functional form for P(X|Y), P(X) • Estimate parameters of P(X|Y), P(X) directly from training data • Use Bayes rule to calculate P(Y|X= xi) Discriminative classifiers (e.g., Logistic regression) • Assume some functional form for P(Y|X) • Estimate parameters of P(Y|X) directly from training data Use Naïve Bayes or Logisitic Regression? Consider • Restrictiveness of modeling assumptions • Rate of convergence (in amount of training data) toward asymptotic hypothesis – i.e., the learning curve13 Naïve Bayes vs Logistic Regression Consider Y boolean, Xi continuous, X=<X1 ... Xn> Number of parameters to estimate: • NB: • LR: Naïve Bayes vs Logistic Regression Consider Y boolean, Xi continuous, X=<X1 ... Xn> Number of parameters: • NB: 4n +1 • LR: n+1 Estimation method: • NB parameter estimates are uncoupled • LR parameter estimates are coupled14 G.Naïve Bayes vs. Logistic Regression • Generative and Discriminative classifiers • Asymptotic comparison (# training examples infinity) • when conditional independence assumptions correct • GNB, LR produce identical classifiers • when conditional independence assumptions incorrect • LR is less biased – does not assume cond indep. • therefore expected to outperform GNB when both given infinite training data [Ng & Jordan, 2002] Naïve Bayes vs. Logistic Regression • Generative and Discriminative classifiers • Non-asymptotic analysis (see [Ng & Jordan, 2002] ) • convergence rate of parameter estimates – how many training examples needed to assure good estimates? • GNB order log n (where n = # of attributes in X) • LR order n GNB converges more quickly to its (perhaps less accurate) asymptotic estimates Informally: because LR’s parameter estimates are coupled, but GNB’s are not15 Some experiments from UCI data sets [Ng & Jordan, 2002] Summary: Naïve Bayes and Logistic Regression • Modeling assumptions – Naïve Bayes more biased (cond. indep) – Both learn linear decision surfaces • Convergence rate (n=number training examples) – Naïve Bayes ~ O(log n) – Logistic regression ~O(n) • Bottom line – Naïve Bayes converges faster to its (potentially too restricted) final hypothesis16 What you should know: • Logistic regression – Functional form follows from Naïve Bayes assumptions • For Gaussian Naïve Bayes assuming variance σi,k = σi • For discrete-valued Naïve Bayes too – But training procedure picks parameters without the conditional independence assumption – MLE training: pick W to maximize P(Y | X, W) – MAP training: pick W to maximize P(W | X,Y) • regularization: e.g., P(W) ~ N(0,σ) • helps reduce overfitting • Gradient ascent/descent
View Full Document