**Unformatted text preview:**

'&$%Let’s talk about class syllabus!1'&$%CHAPTER 1 INTRODUCTION2CHAPTER 1 INTRODUCTION 3'&$%Statistical Learning• What is statistical learning?– machine learning, data mining– supervised vs unsupervisedCHAPTER 1 INTRODUCTION 4'&$%• How different from traditional inference?– different objectives– different statistical procedures– supervised learning < − − − > regression– unsupervised learning < −− > density estimationCHAPTER 1 INTRODUCTION 5'&$%High dimensional data• What does “high-dimension” mean?– relative to sample sizes– curse of dimensionality– possibly ultra-high: p = exp{O(n)}CHAPTER 1 INTRODUCTION 6'&$%• What can we do with “high-dimension” data?– two-stage procedure with dimension reduction– regularized procedureCHAPTER 1 INTRODUCTION 7'&$%Overview of this course– Learning methods– Learning theory– Methods for high-dimensional dataCHAPTER 1 INTRODUCTION 8'&$%Acknowledgments– data examples, figures from HTF book– learning theories extracted from DGL book– part for high-dimensional data relies on publishedreferences– errors are mineCHAPTER 1 INTRODUCTION 9'&$%CHAPTER 2 STATISTICAL DECISIONTHEORYCHAPTER 2 STATISTICAL DECISION THEORY 10'&$%Set-up in decision theory– X: feature variables– Y : outcome variable (continuous, categorical, ordinal)– (X, Y ) follows some distribution– goal: determine f : X → Y to minimize some lossE[L(Y, f(X))].CHAPTER 2 STATISTICAL DECISION THEORY 11'&$%Loss function L(y, x)– squared loss: L(y, x) = (y − x)2– absolute deviation loss: L(y, x) = |y − x|– Huber loss: L(y, x) = (y − x)2I(|y − x| <δ) + (2δ|y − x| − δ2)I(|y − x| ≥ δ)– zero-one loss: L(y, x) = I(y 6= x)– preference loss:L(y1, y2, x1, x2) = 1 − I(y1< y2, x1< x2)CHAPTER 2 STATISTICAL DECISION THEORY 12'&$%−2 −1 0 1 20 1 2 3 4xloss functionsCHAPTER 2 STATISTICAL DECISION THEORY 13'&$%Optimal f(x)– squared loss: f(X) = E[Y |X]– absolute deviation loss: f(X) = med(Y |X)– Huber loss: ???– zero-one loss: f(X) = argmaxkP (Y = k|X)– preference loss: ???– not all loss functions have explicit solutionsCHAPTER 2 STATISTICAL DECISION THEORY 14'&$%Bayes Error/Risk– Y is binary (0,1)– f(X) = argmaxkP (Y = k|X), the category with> 1/2 probability– the optimal lossE[I(Y 6= f(X))] = E [min(η(X), 1 − η(X))]=12−12E [|2η(X) − 1|] ,where η( X) = E[Y = 1 |X]CHAPTER 2 STATISTICAL DECISION THEORY 15'&$%Direct learning to find optimal decision rule– Empirical data(Xi, Yi), i = 1, ..., n– Direct learning estimates f directly via parametric,semi-parametric, or nonparametric methods– useful if we know the explicit solution of fCHAPTER 2 STATISTICAL DECISION THEORY 16'&$%Indirect learning to find optimal decision rule– Indirect learning estimates f by minimizing(empirical risk)nXi=1L(Yi, f(Xi))– called empirical risk minimization or M-estimation– necessary when we don’t know the explicit solution offCHAPTER 2 STATISTICAL DECISION THEORY 17'&$%Candidate sets for f(x)– if too small: underfit data (lead to bias)– if too large: overfit data (inflated variability)CHAPTER 2 STATISTICAL DECISION THEORY 18'&$%High-dimensional issue– data are sparse (see HTF book 22-25)– local approximation is infeasible– increasing bias and variability with dimensionality– curse of dimensionalityCHAPTER 2 STATISTICAL DECISION THEORY 19'&$%Common considerations for f(x)– Structured estimationlinear functions or local linear functions– Sieve estimationlinear combination of basis function: polynomials,splines, wavelets– Regularized/Penalized estimationlet data choose f by penalizing f from roughnessCHAPTER 2 STATISTICAL DECISION THEORY 20'&$%CHAPTER 3 DIRECT LEARNING:PARAMETRIC APPROACHESCHAPTER 3 PARAMETRIC LEARNING 21'&$%Parametric learning– It is one of direct learning methods.– Estimate f(x) using parametric models.– Linear models are often used.CHAPTER 3 PARAMETRIC LEARNING 22'&$%Linear regression model– Target squared loss or zero-one loss.– Assume f(X) = E[Y |X] = XTβ.– The least squared estimationˆf(x) = xT(XTX)−1XTY.CHAPTER 3 PARAMETRIC LEARNING 23'&$%Shrinkage methods Why shrinkage?– Gain variability reduction by sacrificing predictionaccuracy.– Help to determine important features (variableselection) if any.– Include subset selection, ridge regression, LASSO andet.CHAPTER 3 PARAMETRIC LEARNING 24'&$%Subset selection– Search for the best subset of size k in terms of RSS.– Use leaps and bounds procedure.– Computationally intensive with large dimension.– The best choice of size k is based on Mallow’s CPDetailsCHAPTER 3 PARAMETRIC LEARNING 25'&$%Ridge regression– MinimizenXi=1(Yi− XTiβ)2+ λpXj=1β2j.– Equivalently, minimizenXi=1(Yi− XTiβ)2, subject topXj=1β2j≤ s.– The solutionˆβ = (XTX + λI)−1XTY.– Has Bayesian interpretation.– Shrinkage is uniform for all β’s.CHAPTER 3 PARAMETRIC LEARNING 26'&$%LASSO– MinimizenXi=1(Yi− XTiβ)2+ λpXj=1|βj|.– Equivalently, minimizenXi=1(Yi− XTiβ)2, subject topXj=1|βj| ≤ s.– This is a convex optimization.– Suppose X to have independent columns:ˆβj= sign(ˆβlse)(|ˆβlse| − λ/2)+.– Nonlinear shrinkage property.CHAPTER 3 PARAMETRIC LEARNING 27'&$%Summary– Subset selection is L0-penalty shrinkage butcomputationally intensive.– Ridge regression is L2-penalty shrinkage and shrinksall coefficients the same way.– LASSO is L1-penalty shrinkage and it is a nonlinearshrinkage.CHAPTER 3 PARAMETRIC LEARNING 28'&$%One data example– Data link:http://www-stat.stanford.edu/ hastie/Papers/LARS/diabetes.data– Compare subset selection, ridge regression andLASSOCHAPTER 3 PARAMETRIC LEARNING 29'&$%Other shrinkage methods– Lq-penalty with q ∈ [1, 2]:nXi=1(Yi− XTiβ)2+ λpXj=1|βj|q.– Weighted LASSO (aLASSO):nXi=1(Yi− XTiβ)2+ λpXj=1wj|βj|where wj= |ˆβlse|−q.– SCAD penaltyPpj=1Jλ(|βj|):J0λ(x) = λ(I(x ≤ λ) +(aλ − x)+(a − 1)λI(x > λ)).CHAPTER 3 PARAMETRIC LEARNING 30'&$%−10 −5 0 5 100 2 4 6 8 10(a) Hard thresholdBeta coeffectPenalized coeffHard−threshold−10 −5 0 5 100 2 4 6 8 10(b) Adaptive LASSOBeta coeffectPenalized coeffWeighted L_1 with alpha=3−10 −5 0 5 100 2 4 6 8 10(c) SCADBeta coeffectPenalized coeffSCADCHAPTER 3 PARAMETRIC LEARNING 31'&$%Compare different penalties– All penalties have shrinkage properties.– Some penalties give an oracle property as if the truezeros are known (aLASSO, SCAD).– But aLASSO needs

View Full Document