This preview shows page 1-2-3-4-5-6-7-8-55-56-57-58-59-60-61-62-111-112-113-114-115-116-117-118 out of 118 pages.
'&$%CHAPTER 1 INTRODUCTION1CHAPTER 1 INTRODUCTION 2'&$%Statistical Learning• What is statistical learning?– machine learning, data mining– supervised vs unsupervisedCHAPTER 1 INTRODUCTION 3'&$%• How different from traditional inference?– different objectives– different statistical procedures– supervised learning < − − − > regression– unsupervised learning < −− > density estimationCHAPTER 1 INTRODUCTION 4'&$%High dimensional data• What does “high-dimension” mean?– relative to sample sizes– curse of dimensionality– possibly ultra-high: p = exp{O(n)}CHAPTER 1 INTRODUCTION 5'&$%• What can we do with “high-dimension” data?– two-stage procedure with dimension reduction– regularized procedureCHAPTER 1 INTRODUCTION 6'&$%Overview of this course– Learning methods– Learning theory– Methods for high-dimensional dataCHAPTER 1 INTRODUCTION 7'&$%Acknowledgments– data examples, figures from HTF book– learning theories extracted from DGL book– part for high-dimensional data relies on publishedreferences– errors are mineCHAPTER 1 INTRODUCTION 8'&$%CHAPTER 2 STATISTICAL DECISIONTHEORYCHAPTER 2 STATISTICAL DECISION THEORY 9'&$%Set-up in decision theory– X: feature variables– Y : outcome variable (continuous, categorical, ordinal)– (X, Y ) follows some distribution– goal: determine f : X → Y to minimize some lossE[L(Y, f(X))].CHAPTER 2 STATISTICAL DECISION THEORY 10'&$%Loss function L(y, x)– squared loss: L(y, x) = (y − x)2– absolute deviation loss: L(y, x) = |y − x|– Huber loss: L(y, x) = (y − x)2I(|y − x| <δ) + (2δ|y − x| − δ2)I(|y − x| ≥ δ)– zero-one loss: L(y, x) = I(y 6= x)– preference loss:L(y1, y2, x1, x2) = 1 − I(y1< y2, x1< x2)CHAPTER 2 STATISTICAL DECISION THEORY 11'&$%−2 −1 0 1 20 1 2 3 4xloss functionsCHAPTER 2 STATISTICAL DECISION THEORY 12'&$%Optimal f(x)– squared loss: f(X) = E[Y |X]– absolute deviation loss: f(X) = med(Y |X)– Huber loss: ???– zero-one loss: f(X) = argmaxkP (Y = k|X)– preference loss: ???– not all loss functions have explicit solutionsCHAPTER 2 STATISTICAL DECISION THEORY 13'&$%Estimate f(x)– Empirical data(Xi, Yi), i = 1, ..., n– Direct learning: estimate f directly via parametric,semi-parametric, or nonparametric methods– Indirect learning: estimate f by minimizing(empirical risk)nXi=1L(Yi, f(Xi))CHAPTER 2 STATISTICAL DECISION THEORY 14'&$%Candidate set for f(x)– too small: underfit data– too large: overfit data– even more important with high-dimensional XCHAPTER 2 STATISTICAL DECISION THEORY 15'&$%Why high-dimensionality is an issue?– data are sparse– local approximation is infeasible– increasing bias and variability with dimensionality– curse of dimensionalityCHAPTER 2 STATISTICAL DECISION THEORY 16'&$%Common considerations for f(x)– linear functions or local linear functions– linear combination of basis function: polynomials,splines, wavelets– let data choose f by penalizing f from roughnessCHAPTER 2 STATISTICAL DECISION THEORY 17'&$%CHAPTER 3 DIRECT LEARNING:PARAMETRIC APPROACHESCHAPTER 3 PARAMETRIC LEARNING 18'&$%Parametric learning– It is one of direct learning methods.– Estimate f(x) using parametric models.– Linear models are often used.CHAPTER 3 PARAMETRIC LEARNING 19'&$%Linear regression model– Target squared loss or zero-one loss.– Assume f(X) = E[Y |X] = XTβ.– The least squared estimationˆf(x) = xT(XTX)−1XTY.CHAPTER 3 PARAMETRIC LEARNING 20'&$%Shrinkage methods– Gain variability reduction by sacrificing predictionaccuracy.– Help to determine important features (variableselection) if any.– Include subset selection, ridge regression, LASSO andet.CHAPTER 3 PARAMETRIC LEARNING 21'&$%Subset selection– Search for the best subset of size k in terms of RSS.– Use leaps and bounds procedure.– Computationally intensive with large dimension.– The best choice of size k is based on Mallow’s CP.CHAPTER 3 PARAMETRIC LEARNING 22'&$%Ridge regression– MinimizenXi=1(Yi− XTiβ)2+ λpXj=1β2j.– Equivalently, minimizenXi=1(Yi− XTiβ)2, subject topXj=1β2j≤ s.– The solutionˆβ = (XTX + λI)−1XTY.– Has Bayesian interpretation.CHAPTER 3 PARAMETRIC LEARNING 23'&$%LASSO– MinimizenXi=1(Yi− XTiβ)2+ λpXj=1|βj|.– Equivalently, minimizenXi=1(Yi− XTiβ)2, subject topXj=1|βj| ≤ s.– This is a convex optimization.– Suppose X to have independent columns:ˆβj= sign(ˆβlse)(|ˆβlse| − λ/2)+.– Nonlinear shrinkage property.CHAPTER 3 PARAMETRIC LEARNING 24'&$%Summary– Subset selection is L0-penalty shrinkage butcomputationally intensive.– Ridge regression is L2-penalty shrinkage and shrinksall coefficients the same way.– LASSO is L1-penalty shrinkage and it is a nonlinearshrinkage.CHAPTER 3 PARAMETRIC LEARNING 25'&$%Other shrinkage methods– Lq-penalty with q ∈ [1, 2]:nXi=1(Yi− XTiβ)2+ λpXj=1|βj|q.– Weighted LASSO (aLASSO):nXi=1(Yi− XTiβ)2+ λpXj=1wj|βj|where wj= |ˆβlse|−q.– SCAD penaltyPpj=1Jλ(|βj|):J0λ(x) = λ(I(x ≤ λ) +(aλ − x)+(a − 1)λI(x > λ)).CHAPTER 3 PARAMETRIC LEARNING 26'&$%−10 −5 0 5 100 2 4 6 8 10(a) Hard thresholdBeta coeffectPenalized coeffHard−threshold−10 −5 0 5 100 2 4 6 8 10(b) Adaptive LASSOBeta coeffectPenalized coeffWeighted L_1 with alpha=3−10 −5 0 5 100 2 4 6 8 10(c) SCADBeta coeffectPenalized coeffSCADCHAPTER 3 PARAMETRIC LEARNING 27'&$%Compare different penalties– All penalties have shrinkage properties.– Some penalties give an oracle property as if the truezeros are known (aLASSO, SCAD).– But aLASSO needs a consistent initial estimate (notsuitable for high-dimensional).– SCAD generally needs large sample size and maysuffer computational difficulty (due to itsnon-convexity).CHAPTER 3 PARAMETRIC LEARNING 28'&$%Logistic discriminant analysis– It is often used when Y is dichotomous or categorical.– AssumeP (Y = k|X =exp{βk0+ XTβk}1 +PKl=1exp{βl0+ XTβl}.– Thenˆf(x) = argmaxknˆβk0+ XTˆβko.CHAPTER 3 PARAMETRIC LEARNING 29'&$%Discriminant analysis– Assume that X given Y = k follows a normaldistribution with mean µkand covariance Σk.– For K = 2, the decision rule (quadratic discriminantanalysis) is based on the sign oflogπ2π1−12(x−ˆµ2)TˆΣ−12(x−ˆµ2)+12(x−ˆµ1)TˆΣ−11(x−ˆµ1).– If assume Σ1= Σ2, this results in linear discriminantanaysis.CHAPTER 3 PARAMETRIC LEARNING 30'&$%Generalization– In parametric methods, features X can be replaced bysome basis
View Full Document