New version page


Upgrade to remove ads
Upgrade to remove ads
Unformatted text preview:

'&$%Let’s talk about class syllabus!1'&$%CHAPTER 1 INTRODUCTION2CHAPTER 1 INTRODUCTION 3'&$%Statistical Learning• What is statistical learning?– machine learning, data mining– supervised vs unsupervisedCHAPTER 1 INTRODUCTION 4'&$%• How different from traditional inference?– different objectives– different statistical procedures– supervised learning < − − − > regression– unsupervised learning < −− > density estimationCHAPTER 1 INTRODUCTION 5'&$%High dimensional data• What does “high-dimension” mean?– relative to sample sizes– curse of dimensionality– possibly ultra-high: p = exp{O(n)}CHAPTER 1 INTRODUCTION 6'&$%• What can we do with “high-dimension” data?– two-stage procedure with dimension reduction– regularized procedureCHAPTER 1 INTRODUCTION 7'&$%Overview of this course– Learning methods– Learning theory– Methods for high-dimensional dataCHAPTER 1 INTRODUCTION 8'&$%Acknowledgments– data examples, figures from HTF book– learning theories extracted from DGL book– part for high-dimensional data relies on publishedreferences– errors are mineCHAPTER 1 INTRODUCTION 9'&$%CHAPTER 2 STATISTICAL DECISIONTHEORYCHAPTER 2 STATISTICAL DECISION THEORY 10'&$%Set-up in decision theory– X: feature variables– Y : outcome variable (continuous, categorical, ordinal)– (X, Y ) follows some distribution– goal: determine f : X → Y to minimize some lossE[L(Y, f(X))].CHAPTER 2 STATISTICAL DECISION THEORY 11'&$%Loss function L(y, x)– squared loss: L(y, x) = (y − x)2– absolute deviation loss: L(y, x) = |y − x|– Huber loss: L(y, x) = (y − x)2I(|y − x| <δ) + (2δ|y − x| − δ2)I(|y − x| ≥ δ)– zero-one loss: L(y, x) = I(y 6= x)– preference loss:L(y1, y2, x1, x2) = 1 − I(y1< y2, x1< x2)CHAPTER 2 STATISTICAL DECISION THEORY 12'&$%−2 −1 0 1 20 1 2 3 4xloss functionsCHAPTER 2 STATISTICAL DECISION THEORY 13'&$%Optimal f(x)– squared loss: f(X) = E[Y |X]– absolute deviation loss: f(X) = med(Y |X)– Huber loss: ???– zero-one loss: f(X) = argmaxkP (Y = k|X)– preference loss: ???– not all loss functions have explicit solutionsCHAPTER 2 STATISTICAL DECISION THEORY 14'&$%Bayes Error/Risk– Y is binary (0,1)– f(X) = argmaxkP (Y = k|X), the category with> 1/2 probability– the optimal lossE[I(Y 6= f(X))] = E [min(η(X), 1 − η(X))]=12−12E [|2η(X) − 1|] ,where η( X) = E[Y = 1 |X]CHAPTER 2 STATISTICAL DECISION THEORY 15'&$%Direct learning to find optimal decision rule– Empirical data(Xi, Yi), i = 1, ..., n– Direct learning estimates f directly via parametric,semi-parametric, or nonparametric methods– useful if we know the explicit solution of fCHAPTER 2 STATISTICAL DECISION THEORY 16'&$%Indirect learning to find optimal decision rule– Indirect learning estimates f by minimizing(empirical risk)nXi=1L(Yi, f(Xi))– called empirical risk minimization or M-estimation– necessary when we don’t know the explicit solution offCHAPTER 2 STATISTICAL DECISION THEORY 17'&$%Candidate sets for f(x)– if too small: underfit data (lead to bias)– if too large: overfit data (inflated variability)CHAPTER 2 STATISTICAL DECISION THEORY 18'&$%High-dimensional issue– data are sparse (see HTF book 22-25)– local approximation is infeasible– increasing bias and variability with dimensionality– curse of dimensionalityCHAPTER 2 STATISTICAL DECISION THEORY 19'&$%Common considerations for f(x)– Structured estimationlinear functions or local linear functions– Sieve estimationlinear combination of basis function: polynomials,splines, wavelets– Regularized/Penalized estimationlet data choose f by penalizing f from roughnessCHAPTER 2 STATISTICAL DECISION THEORY 20'&$%CHAPTER 3 DIRECT LEARNING:PARAMETRIC APPROACHESCHAPTER 3 PARAMETRIC LEARNING 21'&$%Parametric learning– It is one of direct learning methods.– Estimate f(x) using parametric models.– Linear models are often used.CHAPTER 3 PARAMETRIC LEARNING 22'&$%Linear regression model– Target squared loss or zero-one loss.– Assume f(X) = E[Y |X] = XTβ.– The least squared estimationˆf(x) = xT(XTX)−1XTY.CHAPTER 3 PARAMETRIC LEARNING 23'&$%Shrinkage methods Why shrinkage?– Gain variability reduction by sacrificing predictionaccuracy.– Help to determine important features (variableselection) if any.– Include subset selection, ridge regression, LASSO andet.CHAPTER 3 PARAMETRIC LEARNING 24'&$%Subset selection– Search for the best subset of size k in terms of RSS.– Use leaps and bounds procedure.– Computationally intensive with large dimension.– The best choice of size k is based on Mallow’s CPDetailsCHAPTER 3 PARAMETRIC LEARNING 25'&$%Ridge regression– MinimizenXi=1(Yi− XTiβ)2+ λpXj=1β2j.– Equivalently, minimizenXi=1(Yi− XTiβ)2, subject topXj=1β2j≤ s.– The solutionˆβ = (XTX + λI)−1XTY.– Has Bayesian interpretation.– Shrinkage is uniform for all β’s.CHAPTER 3 PARAMETRIC LEARNING 26'&$%LASSO– MinimizenXi=1(Yi− XTiβ)2+ λpXj=1|βj|.– Equivalently, minimizenXi=1(Yi− XTiβ)2, subject topXj=1|βj| ≤ s.– This is a convex optimization.– Suppose X to have independent columns:ˆβj= sign(ˆβlse)(|ˆβlse| − λ/2)+.– Nonlinear shrinkage property.CHAPTER 3 PARAMETRIC LEARNING 27'&$%Summary– Subset selection is L0-penalty shrinkage butcomputationally intensive.– Ridge regression is L2-penalty shrinkage and shrinksall coefficients the same way.– LASSO is L1-penalty shrinkage and it is a nonlinearshrinkage.CHAPTER 3 PARAMETRIC LEARNING 28'&$%One data example– Data link: hastie/Papers/LARS/– Compare subset selection, ridge regression andLASSOCHAPTER 3 PARAMETRIC LEARNING 29'&$%Other shrinkage methods– Lq-penalty with q ∈ [1, 2]:nXi=1(Yi− XTiβ)2+ λpXj=1|βj|q.– Weighted LASSO (aLASSO):nXi=1(Yi− XTiβ)2+ λpXj=1wj|βj|where wj= |ˆβlse|−q.– SCAD penaltyPpj=1Jλ(|βj|):J0λ(x) = λ(I(x ≤ λ) +(aλ − x)+(a − 1)λI(x > λ)).CHAPTER 3 PARAMETRIC LEARNING 30'&$%−10 −5 0 5 100 2 4 6 8 10(a) Hard thresholdBeta coeffectPenalized coeffHard−threshold−10 −5 0 5 100 2 4 6 8 10(b) Adaptive LASSOBeta coeffectPenalized coeffWeighted L_1 with alpha=3−10 −5 0 5 100 2 4 6 8 10(c) SCADBeta coeffectPenalized coeffSCADCHAPTER 3 PARAMETRIC LEARNING 31'&$%Compare different penalties– All penalties have shrinkage properties.– Some penalties give an oracle property as if the truezeros are known (aLASSO, SCAD).– But aLASSO needs

View Full Document
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...

Join to view LECTURE NOTES and access 3M+ class-specific study document.

We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.


By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?