1CS6375 Machine LearningSimple Linear Regression, Logistic RegressionInstructor: Yang LiuSpring 2015Slides modified Tom Mitchell, Paul Resnick 2Regression Models Answer ‘what is the relationship between the variables?’ 1 numerical dependent (response) variable What is to be predicted 1 or more numerical or categorical independent (explanatory) variables Find a simple, convenient mathematical function to fit data samples3Types of Regression ModelsRegressionModelsLinearNon-Linear2+ ExplanatoryVariablesSimpleNon-LinearMultipleLinear1 ExplanatoryVariableRegressionModelsLinearNon-Linear2+ ExplanatoryVariablesSimpleNon-LinearMultipleLinear1 ExplanatoryVariableBased on number of explanatory variables & nature of relationshipY Xi i i= + +β β ε0 1Linear Regression Model Relationship between variables is a linear function e.g., relationship between income and educationDependent (Response) Variable(e.g., income)Independent (Explanatory) Variable (e.g., education)Random Error502040600 20 40 60XYScattergram Plot of all (Xi, Yi) pairs Suggests how well model will fit602040600 20 40 60XYThinking ChallengeHow would you draw a line through the points? How do you determine which line ‘fits best’?7Least Squares ‘Best Fit’ means difference between actual Y values & predicted Y is minimum LS minimizes the sum of the squared differences()∑∑===−niiniiiYY1212ˆˆε()∑∑===−niiniiiYY1212ˆˆε8Least Squares Graphicallyε2YXε1ε3ε4^^^^ε2YXε1ε3ε4^^^^Y X2 0 1 2 2==== ++++ ++++$ $$ββββ ββββ εεεεY X2 0 1 2 2==== ++++ ++++$ $$ββββ ββββ εεεε$$ $Y Xi i==== ++++ββββ ββββ0 1$$ $Y Xi i==== ++++ββββ ββββ0 1LS minimizes $ $ $ $ $εεεε εεεε εεεε εεεε εεεεiin2112223242==== ++++ ++++ ++++====∑∑∑∑LS minimizes $ $ $ $ $εεεε εεεε εεεε εεεε εεεεiin2112223242==== ++++ ++++ ++++====∑∑∑∑9Derivation of Parameter Equations Goal: Minimize squared error()( )xnnynxyxyiiiii1010021002ˆˆ2))ˆˆ(2(ˆˆˆˆˆ0ββββββββε−−−=−−−=∂−−∂=∂∂=∑∑∑xy10ˆˆββ−=10Derivation of Parameter Equations()( )( )∑−+−−=∑−−−=∂∑−−∂=∂∑∂=iiiiiiiiixxyyxxyxxy1110121012ˆˆ2ˆˆ2ˆˆˆˆˆ0ββββββββε()()( )( ) ( )()xxxyiiiiiiiiSSSSyyxxxxxxyyxxxx=∑ ∑−−=−−∑ ∑−=−111ˆˆˆβββ11Coefficient EquationsSample SlopeSample Y-interceptPrediction Equationxy10ˆˆββ−=()()( )∑−∑−−==21ˆxxyyxxSSSSiiixxxyβiixy10ˆˆˆββ+=12Interpretation of Coefficients Slope (β1) Estimated Y changes by β1for each 1 unit increase in X If β1= 2, then sales (Y) is expected to increase by 2 for each 1 unit increase in advertising (X) Y-Intercept (β0) Average value of Y when X = 0 If β0= 4, then average sales (Y) is expected to be 4 when advertising (X) is 0^^^^^13Example: R&D and New Products How does investment in R&D affect the number of new products developed? We can postulate the following relation: # of new products = α + β*Investment in R&D + uRD8006004002000NEWPROD50403020100Scatter plot14Example: R&D and New Products The estimate for β = 0.049 This tells us that in order to increase the number of new products in one unit, we need to invest a little bit more than 20 monetary units in R&D. If a company invests 1000 in R&D, we would predict this company to develop around 49 new productsRD8006004002000NEWPROD5040302010015Logistic Regression It’s actually a binary classifier16Another Example: Failing or Passing an Exam Let us define a variable ‘Outcome’ Outcome = 0 if the individual fails the exam= 1 if the individual passes the exam We can reasonably assume that Failing or Passing an exam depends on the quantity of hours we use to study Note that in this case, the dependent variable takes only two possible values. We will call it ‘dichotomic’ variable17Regression Analysis with Dichotomic Dependent Variables We will be interested then in inference about the probability of passing the exam. Were we to use linear regression, we would postulate:Prob (Outcome=1) = α + β*Quantity of hours of study + uWe will call this model a ‘Linear Probability Model’ (LPM)18Linear Probability Models (LPM) Our dataset contains information about 14 students. Our statistical software will happily perform a linear regression of Outcome, on the quantity of study hours.Student id Outcome Quantity of Study Hours1 0 32 1 343 0 174 0 65 0 126 1 157 1 268 1 299 0 1410 1 5811 0 212 1 3113 1 2614 0 1119Linear Probability Models (LPM) –What is Wrong about them? Let us do a scatter plot and insert the regression line: A straight line will predict values between negative and positive infinity, outside the [0,1] interval!HSTUDY6050403020100OUTCOME1.21.0.8.6.4.20.0-.220Non-Linear Probability Models Goal: model the probability of the event occurring with an explanatory variable ‘X’ The predicted probability need to be [0,1]. There is a threshold above which the probability hardly increases as a reaction to changes in the explanatory variable. Many functions meet these requirements (non-linearity and being bounded within [0,1]). We will focus on the Logistic.21Logistic Regression (starting from Naïve bayes) Consider learning f: X->Y, where X is a vector of real-valued features <x1,…xn>, Y is boolean We could use a Gaussian naïve bayes classifier Assume all xiare conditionally independent given Y Model P(xi|Y=yk) as Gaussian N(µik,δi) Model P(Y) as Bernouli (π) What does that imply about the form of P(Y|X)?22232425Training Logistic Regression: MCLE Choose parameters W=<w0,…wn> to maximize conditional likelihood of training data Training data D={<X1,Y1>…<XL,YL>} Data likelihood = Data conditional likelihood = 2627Gradient Descent No closed-form solution to maximize l(w) Use gradient descent 2829Logistic Regression vs. Naïve Bayes Functional form follows from naïve bayes assumption Training procedure picks parameters without the conditional independence assumption Pick W to maximize P(Y|X,W)30Generative vs. Discriminative classifier Generative (e.g., naïve bayes) Assume some functional form P(X|Y), P(Y) This is the ‘generative’ model Estimate parameters of P(X|Y),P(Y) directly from training data Use bayes rule to calculate P(Y|x=xi) Discriminative Assume
View Full Document