Linear RegressionAarti SinghMachine Learning 10-701/15-781Sept 27, 20102Discrete to Continuous LabelsSportsScienceNewsClassificationRegressionAnemic cellHealthy cellStock Market PredictionY = ?X = Feb01 X = DocumentY = TopicX = Cell ImageY = DiagnosisRegression Tasks3Weather PredictionY = TempX = 7 pmEstimatingContaminationX = new locationY = sensor reading4Supervised LearningSportsScienceNewsClassification: Regression: Probability of ErrorGoal:Mean Squared ErrorY = ?X = Feb01RegressionOptimal predictor:5Intuition: Signal plus (zero-mean) Noise model(Conditional Mean)RegressionOptimal predictor:Proof Strategy:6Dropping subscriptsfor notational convenience≥ 0RegressionOptimal predictor:7Depends on unknown distributionIntuition: Signal plus (zero-mean) Noise model(Conditional Mean)Regression algorithmsLearning algorithm8Linear RegressionLasso, Ridge regression (Regularized Linear Regression)Nonlinear RegressionKernel RegressionRegression Trees, Splines, Wavelet estimators, …Empirical Risk Minimization (ERM)More later…9Empirical Risk Minimizer:Optimal predictor:Law of LargeNumbersClass of predictorsEmpirical mean• Learning DistributionsMax likelihood = Min -ve log likelihood empirical riskWhat is the class F ?Class of parametric distributionsBernoulli (q)Gaussian (m, s2)10ERM – you saw it before!Linear Regression11- Class of Linear functionsb1 - interceptb2 = slopeUni-variate case:Multi-variate case:1where ,Least Squares EstimatorLeast Squares Estimator12Least Squares Estimator13Normal Equations14If is invertible, When is invertible ? Recall: Full rank matrices are invertible. What is rank of ? What if is not invertible ? Regularization (later)p xp p x1 p x1Geometric Interpretation15Difference in prediction on training set:is the orthogonal projection of onto the linear subspace spanned by the columns of 0Revisiting Gradient Descent16Even when is invertible, might be computationally expensive if A is huge.Initialize: Update:0 if = Stop: when some criterion met e.g. fixed # iterations, or < ε.Gradient Descent since J(b) is convexEffect of step-size α17Large α => Fast convergence but larger residual errorAlso possible oscillationsSmall α => Slow convergence but small residual errorLeast Squares and MLE19Intuition: Signal plus (zero-mean) Noise modelLeast Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model !log likelihoodRegularized Least Squares and MAP20What if is not invertible ? log likelihood log priorPrior belief that β is Gaussian with zero-mean biases solution to “small” βI) Gaussian Prior0Ridge RegressionClosed form: HWRegularized Least Squares and MAP21What if is not invertible ? log likelihood log priorPrior belief that β is Laplace with zero-mean biases solution to “small” βLassoII) Laplace PriorRidge Regression vs Lasso22Ridge Regression: Lasso:Lasso (l1 penalty) results in sparse solutions – vector with more zero coordinatesGood for high-dimensional problems – don’t have to store all coordinates!βs with constant l1 normIdeally l0 penalty, but optimization becomes non-convexβs with constant l0 normβs with constant J(β)(level sets of J(β))βs with constant l2 normβ2β1HOT!Beyond Linear Regression23Polynomial regressionRegression with nonlinear features/basis functionsKernel regression - Local/Weighted regressionRegression trees – Spatially adaptive regressionhPolynomial Regression24Univariate (1-d) case:where ,Nonlinear featuresWeight ofeach feature25http://mste.illinois.edu/users/exner/java.f/leastsquares/Polynomial RegressionNonlinear Regression26Fourier Basis Wavelet BasisNonlinear features/basis functionsBasis coefficientsGood representation for oscillatory functionsGood representation for functionslocalized at multiple scalesLocal Regression27Nonlinear features/basis functionsBasis coefficientsGlobally supportedbasis functions (polynomial, fourier)will not yield a good representationLocal Regression28Nonlinear features/basis functionsBasis coefficientsGlobally supportedbasis functions (polynomial, fourier)will not yield a good representationWhat you should knowLinear RegressionLeast Squares EstimatorNormal EquationsGradient DescentGeometric and Probabilistic Interpretation (connection to MLE)Regularized Linear Regression (connection to MAP)Ridge Regression, LassoPolynomial Regression, Basis (Fourier, Wavelet) EstimatorsNext time- Kernel Regression (Localized)- Regression
View Full Document