DOC PREVIEW
CMU CS 10701 - lecture

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Machine LearningMachine Learning1010--701/15701/15--781, Spring 2008781, Spring 2008Logistic RegressionLogistic Regression----generative verses discriminative generative verses discriminative classifier classifier Eric XingEric XingLecture 5, January 30, 2006Reading: Chap. 3.1.3-4 CBGenerative vs. Discriminative Classifiersz Goal: Wish to learn f: X → Y, e.g., P(Y|X)z Generative classifiers (e.g., Naïve Bayes):z Assume some functional form for P(X|Y), P(Y)This is a ‘generative’ model of the data!z Estimate parameters of P(X|Y), P(Y) directly from training dataz Use Bayes rule to calculate P(Y|X= xi)z Discriminative classifiers:z Directly assume some functional form for P(Y|X)This is a ‘discrimina tive’ model of the data!z Estimate parameters of P(Y|X) directly from training dataYiXiYiXi2Consider the a Gaussian Generative Classifier z learning f: X → Y, wherez X is a vector of real-valued features, < X1…Xn >z Y is booleanz What does that imply about the form of P(Y|X)?z The joint probability of a datum and its label is:z Given a datum xn, we predict its label using the conditional probability of the label given the datum:YiXi{}221212221111)-(-exp)(),,|()(),|,(/knkknnknknnxyxpypyxpµπσπσµσµσ==×==={}{}∑==''/'/)-(-exp)()-(-exp)(),,|(kknkknknknxxxyp2212122212122221211µπσπµπσπσµσσNaïve Bayes Classifier z When X is multivariate-Gaussian vector:z The joint probability of a datum and it label is:z The naïve Bayes simplificationz More generally:z Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density{})-()-(-exp)(),,|()(),|,(/knTknkknnknknnxxyxpypyxpµµππµµrrrrrrrr1212121111−ΣΣ=Σ=×==Σ=YiXi{}∏∏==×===jjkjnjkkjjkjkknjnknknnxyxpypyxpjk221212221111)-(-exp)(),,|()(),|,(,/,,,,µπσπσµσµσ∏=×=mjnjnnnnyxpypyxp1),|()|(),|,(ηππηYiXi1Xi2Xim…3The predictive distributionz Understanding the predictive distributionz Under naïve Bayes assumption: z For two class (i.e., K=2), and when the two classes haves the same variance, ** turns out to be the logistic function* ),|,(),|,(),|(),,|,(),,,|(''''∑ΣΣ=ΣΣ==Σ=kkknkkknknnknnknxNxNxpxypxypµπµπµπµπµvvv11(){}1122212211222212122112111111ππσσσµπσµπµµµµσσ)(2log)-(explog)-(explog)][-]([)-(exp −⎪⎭⎪⎬⎫⎪⎩⎪⎨⎧⎟⎟⎠⎞⎜⎜⎝⎛−−−⎪⎭⎪⎬⎫⎪⎩⎪⎨⎧⎟⎟⎠⎞⎜⎜⎝⎛−−−++−+=∑∑+=∑jjjjjjnCxCxjjjjjjnjjjjjnjxnTxeθ−+=11 )|(nnxyp 11=** log)(explog)(exp),,,|(','','',,∑∑∑⎪⎭⎪⎬⎫⎪⎩⎪⎨⎧⎟⎟⎠⎞⎜⎜⎝⎛−−−−⎪⎭⎪⎬⎫⎪⎩⎪⎨⎧⎟⎟⎠⎞⎜⎜⎝⎛−−−−=Σ=kjjkjkjnjkkjjkjkjnjkknknCxCxxypσµσπσµσππµ222221211vThe decision boundaryz The predictive distributionz The Bayes decision rule:z For multiple class (i.e., K>2), * correspond to a softmax function∑−−==jxxnknnTjnTkeexypθθ )|( 1nTxMjjnjnnexxypθθθ−=+=⎭⎬⎫⎩⎨⎧−−+==∑11111011 exp )|(nTxxxnnnnxeeexypxypnTnTnTθθθθ=⎟⎟⎟⎟⎠⎞⎜⎜⎜⎜⎝⎛++===−−−1111121 ln )|()|(ln4Discussion: Generative and discriminative classifiersz Generative:z Modeling the joint distribution of all dataz Discriminative:z Modeling only points at the boundaryz How? Regression!Linear regression z The data:z Both nodes are observed:z X is an input vectorz Y is a response vector (we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator)z A regression scheme can be used to model p(y|x) directly,rather than p(x,y)YiXiN{}),(,),,(),,(),,(NNyxyxyxyx L3322115Logistic regression (sigmoid classifier)z The condition distribution: a Bernoulliwhere µis a logistic functionz We can used the brute-force gradient method as in LRz But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model (see next lecture!)yyxxxyp−−=11 ))(()()|(µµxTexθµ−+=11)(Training Logistic Regression: MCLEz Estimate parameters θ=<θ0, θ1, ... θn> to maximize the conditional likelihood of training dataz Training data z Data likelihood = z Data conditional likelihood =6Expressing Conditional Log Likelihoodz Recall the logistic function:and conditional likelihood: Maximizing Conditional Log Likelihoodz The objective:z Good news: l(θ) is concave function of θz Bad news: no closed-form solution to maximize l(θ)7Gradient Ascentz Property of sigmoid function:z The gradient:The gradient ascent algorithm iterate until change < εFor all i,repeatHow about MAP?z It is very common to use regularized maximum likelihood.z One common approach is to define priors on θz – Normal distribution, zero mean, identity covariancez Helps avoid very large weights and overfittingz MAP estimate8MLE vs MAPz Maximum conditional likelihood estimatez Maximum a posteriori estimateLogistic regression: practical issuesz IRLS takes O(Nd3) per iteration, where N= number of training cases and d= dimension of input x.z Quasi-Newton methods, that approximate the Hessian, work faster.z Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.z Stochastic gradient descent can also be used if Nis large c.f. perceptron rule:9Naïve Bayes vs Logistic Regressionz Consider Y boolean, X continuous, X=<X1... Xn>z Number of parameters to estimate:NB:LR:z Estimation method:z NB parameter estimates are uncoupledz LR parameter estimates are coupledNaïve Bayes vs Logistic Regressionz Asymptotic comparison (# training examples → infinity)z when model assumptions correctz NB, LR produce identical classifiersz when model assumptions incorrectz LR is less biased – does not assume conditional independencez therefore expected to outperform NB10Naïve Bayes vs Logistic Regressionz Non-asymptotic analysis (see [Ng & Jordan, 2002] )z convergence rate of parameter estimates – how many training examples needed to assure good estimates?NB order log n (where n = # of attributes in X)LR order nz NB converges more quickly to its (perhaps less helpful) asymptotic estimatesRate of convergence: logistic regressionz Let hDis,mbe logistic regression trained on m examples in n dimensions. Then with high probability:z Implication: if we want for some small constant ε0, it suffices to pick order n examplesÆ Convergences to its asymptotic classifier, in order n examplesz result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an


View Full Document

CMU CS 10701 - lecture

Documents in this Course
lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?