DOC PREVIEW
UB CSE 574 - Logistic Regression

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 29 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Logistic Regression Sargur N. Srihari University at Buffalo, State University of New York USATopics in Linear Classification using Probabilistic Discriminative Models 1. Generative vs Discriminative 2. Nonlinear basis funcs in linear classification 3. Logistic Regression – Two-class, Multi-class – Parameters using • Maximum Likelihood • Iterative Reweighted Least Squares 4. Probit Regression 5. Canonical Link Functions 2 Srihari Machine LearningGenerative vs Discriminative • Probabilistic generative models (linear) – 2-class: p(C1|x) written as s operating on linear function of x i.e., wtx + w0, for wide choice of forms for p(x|Ck) – Multiclass: p(Ck|x) given by softmax of linear function of x – MLE used to get parameters of p(x|Ck) and priors p(Ck) – Can generate synthetic data from marginal p(x) • Probabilistic discriminative models – Direct approach – Maximize likelihood function of conditional distribution p(Ck|x) • Advantages – Fewer adaptive parameters – Improved performance when p(x|Ck) assumptions are poor approximations Srihari Machine LearningNonlinear Basis Functions in Linear Models x1x2!1 0 1!101φ1φ20 0.5 100.51not linearly separable linearly separable Although we use linear classification models Linear-separability in feature space does not imply linear-separability in input space Srihari Machine Learning Nonlinear transformation of inputs using vector of basis functions φ(x) Original Input Space (x1,x2) Feature Space (φ1,φ2)Properties: A. Symmetry σ(-a)=1-σ(a) B. Inverse a=ln(σ /1-σ) known as logit. Also known as log odds since it is the ratio ln[p(C1|x)/p(C2|x)] C. Derivative ds/da=σ(1-σ) 2. Logistic Regression • Feature vector φ , two-classes C1 and C2 • A posteriori probability p(C1|φ) can be written as p(C1|φ) =y(φ) = σ (wTφ) where φ is a M-dimensional feature vector σ (.) is the logistic sigmoid function • Goal is to determine the M parameters • Known as logistic regression in statistics – Although a model for classification rather than for regression a σ(a) Logistic Sigmoid Srihari Machine LearningFewer Parameters in Linear Discriminative Model • Discriminative approach (Logistic Regression) – For M -dim space: M adjustable parameters • Generative based on Gaussians (Bayes/NB) • 2M parameters for mean • M(M+1)/2 parameters for shared covariance matrix • Two class priors • Total of M(M+5)/2 + 1 parameters – Grows quadratically with M • If features assumed independent (naïve Bayes) still needs M+3 parameters 6 Srihari Machine LearningDetermining Logistic Regression parameters • Maximum Likelihood Approach for Two classes Data set (φn,tn} where tn ε {0,1} and φn=φ(xn), n=1,..,N Since tn is binary we can use Bernoulli Let yn be the probability that tn =1 • Likelihood function associated with N observations where t =(t1,..,tN)T and yn= p(C1|φn)  p(t | w) = yntnn=1N∏1− yn{ }1−tn7 Srihari Machine LearningError Fn for Logistic Regression Likelihood function By taking negative logarithm we get the Cross-entropy Error Function We need to minimize E(w) At its minimum, derivative of E(w) is zero So we need to solve for w in the equation  p(t | w) = yntnn=1N∏1− yn{ }1−tn  E(w) = −ln p(t | w) = − tnln yn+ (1− tn)ln(1− yn){ }n=1N∑8 Srihari Machine Learning ∇E(w) = 0Gradient of Error Function Error function where yn= σ(wTφn) Using Derivative of logistic sigmoid Gradient of the error function  dσda=σ(1−σ)  E(w) = −ln p(t | w) = − tnln yn+ (1− tn)ln(1− yn){ }n=1N∑  ∇E(w) = yn− tn( )n=1N∑φn  Let z = z1+ z2where z1= t lnσ(wφ) and z2= (1− t)ln[1−σ(wφ)]dz1dw=tσ(wφ)[1−σ(wφ)]φσ(wφ)anddz2dw=(1− t)σ(wφ)[1−σ(wφ)](−φ)[1−σ(wφ)]Therefore dzdw= (σ(wφ) − t)φUsing Error x Feature Vector 9 Srihari Machine Learning Contribution to gradient by data point n is error between target tn and prediction yn= s(wTfn) times basis fn  ddx(ln ax) =axSimple Sequential Algorithm • Given Gradient of error function • Solve using an iterative approach • where Srihari 10  wτ+1= wτ−η∇En ∇En= (yn− tn)φn  ∇E(w) = yn− tn( )n=1N∑φnSolution has severe over-fitting problems for linearly separable data So use IRLS algorithm • No closed-form maximum likelihood solution for determining w Error x Feature Vector Machine LearningMore Efficient Iterative Algorithm • Based on second derivatives • Called Newton-Raphson method • Derivative at point x of a function is the slope of its tangent at that point Machine Learning Srihari 11 11 Since we are solving for derivative of E(w) Need second derivative Newton’s Method12 Derivatives of Gaussian p(x)~N(0, σ)Iterative Reweighted Least Squares (IRLS) 13 • Efficient approximation using Newton-Raphson iterative optimization • where H is the Hessian matrix whose elements are the second derivatives of E(w) with respect to the components of w and is the first derivative of E(w) since the second term is f(w)/f’(w) where f(w)=  w(new )= w(old )− H−1∇E (w)Since we are solving for derivative of E(w) Need second derivative Newton’s Method Srihari Machine Learning ∇E(w) ∇E(w)Two applications of IRLS • IRLS is applicable to both Linear Regression and Logistic Regression • We discuss both, for each we need 1. Error function E(w) • Linear Regression: Sum of Squared Errors • Logistic Regression: Bernoulli Likelihood Function 2. Gradient 3. Hessian 4. Newton-Raphson update Srihari 14  w(new )= w(old )− H−1∇E (w)Machine Learning  ∇E(w)  H = ∇∇E (w)IRLS for Linear Regression • Model with linear combination of input variables 1. Error Function: Sum of Squared Errors for data set X={xn,tn} n=1,..N 2. Gradient of Error Function is: 3. Hessian is:  ∇E(w) = (wTφn− tn)φnn=1N∑ = ΦTΦw − ΦTt H = ∇∇E (w) =φnφnTn=1N∑= ΦTΦwhere Φ is the N x M design matrix whose nth row is given by φnT Srihari Machine Learning  y(x,w) = wjφj(x) = wTφ(x)j= 0M −1∑  E (w) =12tn− wTφ(xn){ }2n=1N∑⎟⎟⎟⎟⎟⎠⎞⎜⎜⎜⎜⎜⎝⎛=Φ−−)x()x()x()x(...)x()x(1020111110NMNMφφφφφφ4. Newton-Raphson for Linear


View Full Document

UB CSE 574 - Logistic Regression

Download Logistic Regression
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Logistic Regression and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Logistic Regression 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?