1 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 4, 2011 Today: • Logistic regression • Generative/Discriminative classifiers Readings: Required: • Mitchell: “Naïve Bayes and Logistic Regression” (available on class website) • Bishop: Chapt. 3 through 3.2 Optional • Ng & Jordan (see class website) • Consider learning f: X à Y, where • X is a vector of real-valued features, < X1 … Xn > • Y is boolean • assume all Xi are conditionally independent given Y • model P(Xi | Y = yk) as Gaussian N(µik,σi) • model P(Y) as Bernoulli (π) • What does that imply about the form of P(Y|X)?2 Training Logistic Regression: MCLE • Choose parameters W=<w0, ... wn> to maximize conditional likelihood of training data • Training data D = • Data likelihood = • Data conditional likelihood = where Expressing Conditional Log Likelihood3 Maximizing Conditional Log Likelihood Good news: l(W) is convex function of W!Bad news: no closed-form solution to maximize l(W)!4 Gradient Descent: Batch gradient: use error over entire training set D Do until satisfied: 1. Compute the gradient 2. Update the vector of parameters: Stochastic gradient: use error over single examples Do until satisfied: 1. Choose (with replacement) a random training example 2. Compute the gradient just for : 3. Update the vector of parameters: Stochastic approximates Batch arbitrarily closely as Stochastic can be much faster when D is very large Intermediate approach: use error over subsets of D Maximize Conditional Log Likelihood: Gradient Ascent5 Maximize Conditional Log Likelihood: Gradient Ascent Gradient ascent algorithm: iterate until change < ε$ For all i, repeat That’s all for M(C)LE. How about MAP? • One common approach is to define prior on weights W=<w0, w1, … wn> • Helps avoid very large weights and overfitting • MAP estimate • let’s assume Gaussian prior: each wi ~ N(0, σ)6 MLE vs MAP • Maximum conditional likelihood estimate • MAP estimate with Gaussian prior called a “regularization” term • Consider learning f: X à Y, where • X is a vector of real-valued features, < X1 … Xn > • Y is boolean • assume all Xi are conditionally independent given Y • model P(Xi | Y = yk) as Gaussian N(µik,σi) • model P(Y) as Bernoulli (π) • Then P(Y|X) is of this form, and we can directly estimate W • Furthermore, same holds if the Xi are boolean • trying proving that to yourself The Bottom Line7 Generative vs. Discriminative Classifiers Training classifiers involves estimating f: X à Y, or P(Y|X) Generative classifiers (e.g., Naïve Bayes) • Assume some functional form for P(X|Y), P(X) • Estimate parameters of P(X|Y), P(X) directly from training data • Use Bayes rule to calculate P(Y|X= xi) Discriminative classifiers (e.g., Logistic regression) • Assume some functional form for P(Y|X) • Estimate parameters of P(Y|X) directly from training data Use Naïve Bayes or Logisitic Regression? Consider • Restrictiveness of modeling assumptions • Rate of convergence toward asymptotic hypothesis – How does increasing number of features n influence need for larger training set?8 Naïve Bayes vs Logistic Regression Consider Y boolean, Xi continuous, X=<X1 ... Xn> Number of parameters to estimate: • NB: • LR: Naïve Bayes vs Logistic Regression Consider Y boolean, Xi continuous, X=<X1 ... Xn> Number of parameters: • NB: 4n +1 • LR: n+1 Estimation method: • NB parameter estimates are uncoupled • LR parameter estimates are coupled9 G.Naïve Bayes vs. Logistic Regression Recall two assumptions deriving form of LR from GNBayes: 1. Xi conditionally independent of Xk given Y 2. P(Xi | Y = yk) = N(µik,σi), ß not N(µik,σik) Consider three learning methods: • GNB (assumption 1 only) • GNB2 (assumption 1 and 2) • LR Which method works better if we have infinite training data, and... • Both (1) and (2) are satisfied • Neither (1) nor (2) is satisfied • (1) is satisfied, but not (2) [Ng & Jordan, 2002] G.Naïve Bayes vs. Logistic Regression Recall two assumptions deriving form of LR from GNBayes: 1. Xi conditionally independent of Xk given Y 2. P(Xi | Y = yk) = N(µik,σi), ß not N(µik,σik) Consider three learning methods: • GNB (assumption 1 only) -- decision surface can be non-linear • GNB2 (assumption 1 and 2) – decision surface linear • LR -- decision surface linear, trained differently Which method works better if we have infinite training data, and... • Both (1) and (2) are satisfied: LR = GNB2 = GNB • Neither (1) nor (2) is satisfied: LR > GNB2, GNB>GNB2 • (1) is satisfied, but not (2) : GNB > LR, LR > GNB2 [Ng & Jordan, 2002]10 G.Naïve Bayes vs. Logistic Regression What if we have only finite training data? They converge at different rates to their asymptotic (∞ data) error Let refer to expected error of learning algorithm A after n training examples Let d be the number of features: <X1 … Xd> So, GNB requires n = O(log d) to converge, but LR requires n = O(d) [Ng & Jordan, 2002] Some experiments from UCI data sets [Ng & Jordan, 2002]11 Naïve Bayes vs. Logistic Regression The bottom line: GNB2 and LR both use linear decision surfaces, GNB need not Given infinite data, LR is better or equal to GNB2 because training procedure does not make assumptions 1 or 2 (though our derivation of the form of P(Y|X) did). But GNB2 converges more quickly to its perhaps-less-accurate asymptotic error And GNB is both more biased (assumption1) and less (no assumption 2) than LR, so either might beat the other What you should know: • Logistic regression – Functional form follows from Naïve Bayes assumptions • For Gaussian Naïve Bayes assuming variance σi,k = σi • For discrete-valued Naïve Bayes too – But training procedure picks parameters without making conditional independence assumption – MLE training: pick W to maximize P(Y | X, W) – MAP training: pick W to maximize P(W | X,Y) • ‘regularization’ • helps reduce overfitting • Gradient
View Full Document