New version page

# LSU EXST 7034 - Logistic regression

Documents in this Course

14 pages

9 pages

15 pages

3 pages

2 pages

6 pages

6 pages

10 pages

7 pages

12 pages

4 pages

9 pages

12 pages

10 pages

2 pages

5 pages

18 pages

2 pages

2 pages

9 pages

2 pages

7 pages

4 pages

8 pages

14 pages

Unformatted text preview:

EXST7034 : Regression Techniques Geaghan Logistic regression Page 1 Simple linear regression on an indicator variable – a precursor to logistic regression Basically it is a simple linear regression where the dependent variable has a value of either 0 or 1. 01iiiYXeββ=+ + where Yi = 0, 1 This is called a binary response, and the interpretation of E(Yi) is different from the usual response variable. Given that E(εi) = 0, then 01()iiEY Xββ=+ . If Yi is a Bernoulli random variable then the probability distribution is when Yi = 1, ( 1)iiPYπ== and when Yi = 0, ( 1) 1iiPYπ==− . The expected value of the distribution is then given by01() 1()0(1 )iiiiiEY Xββπ ππ=+=+−=. Issues when the response variable is binary. 1) The residuals are not normally distributed. The residuals, defined as 01()iiYXββ−+ will only take on two values. When Yi = 1 the values is 011iXββ−− and when Yi = 0 the value is 01iXββ−− . 2) The residuals are not homogeneous. The variance is given by 22{( ( )) }iYiiEY EYσ=− = 22(1 ) (0 ) (1 )ii i iππππ−+−− = (1 )iiππ− = ( { })(1 { })iiEY EY−. Finally the variance for residual is the same as for Yi because εi = Yi – πi, and πi is a constant. 22{( ( )) }iYiiEY EYσ=− 22{( ( )) }iYiiEY EYσ=− 3) The last issue is that since the response ranges between 0 and 1 there should be a constraint on the response such that 0 { } 1iEYπ≤=≤. Any solution should address these issues. First model, simple linear regression on an indicator variable. 10 This is a "primitive" version of regression on an indicator variable. The predicted value (ˆY) is interpreted as “the probability of getting a 1”. However, this fitted line does not address any of the issues stated above. It does nothing to address the lack of normality, the problem with homogeneity of variance or to keep the line from going below zero and above 1. This solution has not particularly desirable properties. See SAS example – SLR on p and on indicator variableEXST7034 : Regression Techniques Geaghan Logistic regression Page 2 Second model, a sigmoid response variable. The probit analysis, based on a standard normal cumulative distribution, will be discussed later. However, it is similar to the logistic function. The probit distribution has a normal density function and the logistic function is very similar. The full version of the Logistic Model was discussed in the section on nonlinear models as a common growth model. This three-parameter model is not linear and cannot be fitted with PROC REG or PROC GLM or even PROC Logistic. 20201122()11iiiXXEYeeββββββββ−−−−==+⎛⎞+⎜⎟⎝⎠ The model fitted by PROC LOGISTIC is much simplified since the upper bound (β2) is known to be 1. The logistic mean response function is 1010101exp( ){ } [1 exp( )]1 exp( )iiiiXEY XXββββββ−+==+−−++. The logistic can be derived as the logit transformation of the probability πi which is log1ieiππ⎛⎞⎜⎟−⎝⎠.EXST7034 : Regression Techniques Geaghan Logistic regression Page 3 The final logistic model is then 01log1ieiiXπββπ⎛⎞=+⎜⎟−⎝⎠. The ratio 1iiππ−is called the “odds” and the natural log of this is called the logit response function. What are odds? Odds are an expression of the likelihood of some event happens compared to the likelihood that it does not happen. If the odds on a horse in a race are 30 to 1, is that horse likely to win? Or lose. If the odds of an event happening are 50:50, what does that mean? What if the odds are 1:1? How is that different? The odds is simply the ratio of the probability of the occurance of an event to the probability of that event not occurring. The values of 50/50 or 1/1 both produce odds equal to “1”. They have the same odds. If the odds ratio is 1, then the likelihood of something happening is equal to the likelihood that it will not happen. To simplify our concepts we will think of the odds as the ratio of two probabilities. The probability that some event happens (success) will be equal to p. The probability of failure will be 1–p. The odds is given by p/(1–p). What if the odds ratio is 2? This means that p is twice as large as (1–p), so success is twice as likely as failure. If the odds ratio is 10 the probability of success is 10 times more likely than failure. Odds are also commonly expressed as percents, so an odds of 1.5 means success is 50% greater than the probability of failure. For odds of 2 the probability of success is 100% greater than the probability of success. If the odds are 0.5, then the probability of failure is twice as likely. An odds of 0.1 means the probability of failure is ten times more likely than success. Detransforming odds – The logistic analysis produces “log odds” as predicted values of the dependent variable. Odds are obtained by taking the antilog [exp(YHati) = oddsi). The probability can be obtained by calculating pi = oddsi / (1 + oddsi). Odds ratios Although the odds are a ratio they are usually referred to as just “odds”. The “odds ratio” is a different value. In Analysis of Variance the tests of interest are often difference in means, and in regression the tests of interest often involve the change in Y per unit X, which is the difference between the mean of Y at Xi and the mean of Y at Xi+1. When odds are used as the dependent variable the difference in “means” is the difference in the estimated odds. For analysis of variance the difference would be /(1 )log ( /(1 )) log ( /(1 )) log/(1 )iiei i e j j ejjππππ ππππ⎛⎞−−− − =⎜⎟−⎝⎠. This term would be referred to as the “odds ratio”. Likewise for regression, where the slope is the change in Y per unit X the slope would be given as an odds ratio, 11/(1 )log/(1 )XXeXXππππ++−⎛⎞⎜⎟−⎝⎠.EXST7034 : Regression Techniques Geaghan Logistic regression Page 4 Recall the three issues previously mentioned for working with 0, 1 indicator variables. The use of logistic or probit analyses addresses the third issue mentioned above, the constraint that the response ranges between 0 and 1. These sigmoid curves can be limited to this range. Also note that the odds are not restricted to the 0, 1 range. The probit function is based on a Z distribution transformation, and has a standard normal density

View Full Document