Multiple Linear RegressionDr. İbrahim ÇaparAssistant ProfessorDATA MININGLearning objectiveMultiple Linear RegressionExplanatory vs. predictive modeling with regressionAssessing predictive accuracySelecting a subset of predictorsMultiple Linear RegressionMultiple Linear RegressionMultiple Linear RegressionMultiple Linear RegressionMultiple Linear RegressionExplanatory ModelingGoal: Explain relationship between predictors (explanatory variables) and target Familiar use of regression in data analysisModel Goal: Fit the data well and understand the contribution of explanatory variables to the model“goodness-of-fit”: , residual analysis, p-valuesPredictive ModelingGoal: predict target values in other data where we have predictor values, but not target valuesClassic data mining contextModel Goal: Optimize predictive accuracyTrain model on training dataAssess performance on validation (hold-out) dataExplaining role of predictors is not primary purpose (but useful)An Example: Toyota CorollaAn Example: Toyota CorollaPrice: Sales price in EurosAge_08_04: Age in months as of 8/04KM: Odometer in kilometersFuel_Type: Either diesel, petrol, or compressed natural gas (CNG)HP: HorsepowerMet_color: Is it metallic color (1=yes, 0=no)Automatic: Is it automatic transmission (1=yes, 0=no)CC: Cylinder volumeDoors: Number of doorsQuarterly_Tax: Road tax amount)Weight: Weight of the car in kgDummy (Indicator) VariablesA dummy (indicator) variable indicates whether a characteristic is present or not.D = 1 if the observation has the attribute.D = 0 if the observation does not have it.In general, if there are m levels, use m – 1 dummy (indicator) variables.An example: Toyota CorollaFuel Type: Diesel, Petrol, or CNGFuel Type Diesel 1 0Petrol 0 1CNG 0 0Result: Toyota CorollaExplanatory Modeling Result:= 88.3%= 88.1%Age_08_04: -120.98Fuel_Type_Diesel: 2700.58Predictive Modeling Result:DatasetsMetric Training ValidationMSE1,612,793 5,567,692 86.72% 58.84%Selecting Subsets of PredictorsGoal: Find parsimonious model (the simplest model that performs sufficiently well)More robustHigher predictive accuracyExhaustive SearchPartial Search AlgorithmsForwardBackwardStepwiseExhaustive SearchAll possible subsets of predictors assessed (single, pairs, triplets, etc.)Computationally intensiveJudge by “adjusted”)1(11122RpnnRadjForward SelectionStart with no predictorsAdd them one by one (add the one with largest contribution)Stop when the addition is not statistically significantBackward EliminationStart with all predictorsSuccessively eliminate least useful predictors one by oneStop when all remaining predictors have statistically significant contributionStepwiseLike Forward SelectionExcept at each step, also consider dropping non-significant predictorsPython ImplementationThe term ‘feature’ is commonly used instead of predictors or independent variablesScikit-learn (one of the mostly used machine learning package) does not support stepwise selection method.Instead of train and validation datasets, Scikit-learn use ‘train and test datasets’. Note that from our point of view, test dataset is identical to validation
View Full Document