Unformatted text preview:

Count Data 1. Estimating & testing proportions: Ten customers, 2 purchase a product. We estimate the probability p of purchase as p=0.20 for all customers. Could p really be 0.50 in the population? Binomial n independent trials Each results in success (1) or failure (0) p = probability of success: Constant on all trials. X = observed number of successes in n trials. Pr{X=r} = n!/[r!(n-r)!]pr(1-p)(n-r) n! = n(n-1)(n-2)…(1), 0! = 1! = 1. p known  Pr{X} is probability function. p to be estimated and r known  Pr{X} is now a function of p and is known as the likelihood function L(p). Logarithm is ln(L(p)). Ex: (r=2 n=10) : 45 p2(1-p)8 maximum at p=0.20.2. Contingency Tables Observed Purchase No Purchase Coupon 86 14 No Coupon 24 76 Expected (under H0: no coupon effect) Purchase No Purchase Coupon 55 45 No Coupon 55 45 Pearson Chi-square (k=1 df) cellsallkEEO22)(=77.65 Compare to Chi-square 1 df =1.962 (significant) Likelihood Purchase No Purchase Coupon p186 (1-p1)14 No Coupon p224 (1-p2)76Likelihood is C p186(1-p1)14p224(1-p2)76 Max at p1=0.86, p2=0.24 Max ln(L) is 86ln(.86)+…+76ln(.76)=-95.6043 under H0:p1=p2: Max at p1= p2=0.55, (1-p1) = (1-p2) =0.45 Max ln(L) is 86ln(.55)+…+76ln(.45)=-137.628 Likelihood ratio test (difference between -2 ln(L) values) = 2(137.628-95.6043) = 84.0468 Close, but not the same as Pearson Chi-square (77.65) Run Logistic_A.sas demo.3. Logistic Regression: X = food storage temperature (degrees C) Y = 1 if spoilage after 2 months, 0 otherwise X: -10 0 1 4 6 Y: 0 1 0 1 1 Regress Y on X: Problem: Predicted probabilities >1 or < 0. Idea: Convert p to logit Logit = ln(p/(1-p)) = ln(odds) Model Logit = 0 + 1X p= exp(Logit)/(1+exp(Logit))= exp( 0 + 1X )/(1+ exp( 0 + 1X ))So… use exp( 0 + 1X )/(1+ exp( 0 + 1X )) for p in the likelihood function (you know X) then find betas that maximize this function. Equivalently, minimize -2 ln(likelihood). Any betas whose -2 ln(likelihood) differs from that of the maximum likelihood betas by an amount exceeding the Chi-square 95% point would be rejected in a 5% hypothesis test. Therefore if we truncate our plot at the right point we will cut off the rejected set of betas and have an approximate 95% confidence region for the pair of betas. b1= 0.5597 b0= -0.0167-10 0 1 4 6 Run demo: Logistic_B.sas proc logistic data=logistic; model spoiled(event="1")=temperature/ itprint ctable pprob=0.5; Pairs: one 0 and one 1 Concordant: actual 1 has higher predicted probability than actual 0 (i.e. 1 is to the right of 0 since slope is positive) Discordant pairs: Actual 0 has higher probability of being 1 than does actual 1. We have 2 0’s, 3 1’s, so 2x3=6 pairs. One of those 6 (circled) is discordant and there are no ties so 5/6 are concordant.Association of Predicted Probabilities and Observed Responses Percent Concordant 83.3 Somers' D 0.667 Percent Discordant 16.7 Gamma 0.667 Percent Tied 0.0 Tau-a 0.400 Pairs 6 c 0.833 Prior probability 0.5: Classify any point with higher probability than 0.5 as 1, others as 0. You will have some misclassifications. Classification Table Correct Incorrect Percentages Prob Non- Non- Sensi- Speci- False False Level Event Event Event Event Correct tivity ficity POS NEG 0.500 2 1 1 1 60.0 66.7 50.0 33.3 50.0 Split point between X=0 and X=1. 2 correct events at X=4, X=6. 1 correct non-event at X = -10. One incorrect event at X=1, one incorrect non-event at X=0. Sensitivity: probability of saying a 1 is a 1: 3 1’s, we got 2 of them so 2/3 Specificity: Probability of calling a non-event a non-event ½. (denominators = numbers of actuals)False positives 1/3 of our classified events were non-events. False negatives: ½ of our classified non-events were events. (denominators = numbers of decisions) Odds Ratio: Old Logit = 0 + 1 X New Logit = 0 + 1 (X+1) ln (pnew/(1-pnew)) – ln (pold/(1-pold))= 1 ln(new odds)-ln(old odds) = 1 ln( (new odds)/(old odds) ) = 1 odds ratio = exp(1) e = 1 +  + 2/2! + 3/3! + …. (Taylor) e is approximately 1 +  when  is small Other Stats (source, SAS online help): The following statistics are all rank based correlation statistics for assessing the predictive ability of a model:(nc= # concordant, nd= # discordant, N points, t pairs with different responses) C (area under the ROC curve) (nc + ½ (# ties))/t Somers’ D (nc-nd)/t Kendall’s Tau-a (nc-nd)/(½N(N-1)) Goodman-Kruskal Gamma (nc-nd)/


View Full Document

NCSU ST 610 - Count Data

Download Count Data
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Count Data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Count Data 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?