DOC PREVIEW
UNC-Chapel Hill BIOS 740 - Statistical Learning

This preview shows page 1-2-3-4-5-6-7-8-55-56-57-58-59-60-61-62-111-112-113-114-115-116-117-118 out of 118 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 118 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

'&$%CHAPTER 1 INTRODUCTION1CHAPTER 1 INTRODUCTION 2'&$%Statistical Learning• What is statistical learning?– machine learning, data mining– supervised vs unsupervisedCHAPTER 1 INTRODUCTION 3'&$%• How different from traditional inference?– different objectives– different statistical procedures– supervised learning < − − − > regression– unsupervised learning < −− > density estimationCHAPTER 1 INTRODUCTION 4'&$%High dimensional data• What does “high-dimension” mean?– relative to sample sizes– curse of dimensionality– possibly ultra-high: p = exp{O(n)}CHAPTER 1 INTRODUCTION 5'&$%• What can we do with “high-dimension” data?– two-stage procedure with dimension reduction– regularized procedureCHAPTER 1 INTRODUCTION 6'&$%Overview of this course– Learning methods– Learning theory– Methods for high-dimensional dataCHAPTER 1 INTRODUCTION 7'&$%Acknowledgments– data examples, figures from HTF book– learning theories extracted from DGL book– part for high-dimensional data relies on publishedreferences– errors are mineCHAPTER 1 INTRODUCTION 8'&$%CHAPTER 2 STATISTICAL DECISIONTHEORYCHAPTER 2 STATISTICAL DECISION THEORY 9'&$%Set-up in decision theory– X: feature variables– Y : outcome variable (continuous, categorical, ordinal)– (X, Y ) follows some distribution– goal: determine f : X → Y to minimize some lossE[L(Y, f(X))].CHAPTER 2 STATISTICAL DECISION THEORY 10'&$%Loss function L(y, x)– squared loss: L(y, x) = (y − x)2– absolute deviation loss: L(y, x) = |y − x|– Huber loss: L(y, x) = (y − x)2I(|y − x| <δ) + (2δ|y − x| − δ2)I(|y − x| ≥ δ)– zero-one loss: L(y, x) = I(y 6= x)– preference loss:L(y1, y2, x1, x2) = 1 − I(y1< y2, x1< x2)CHAPTER 2 STATISTICAL DECISION THEORY 11'&$%−2 −1 0 1 20 1 2 3 4xloss functionsCHAPTER 2 STATISTICAL DECISION THEORY 12'&$%Optimal f(x)– squared loss: f(X) = E[Y |X]– absolute deviation loss: f(X) = med(Y |X)– Huber loss: ???– zero-one loss: f(X) = argmaxkP (Y = k|X)– preference loss: ???– not all loss functions have explicit solutionsCHAPTER 2 STATISTICAL DECISION THEORY 13'&$%Estimate f(x)– Empirical data(Xi, Yi), i = 1, ..., n– Direct learning: estimate f directly via parametric,semi-parametric, or nonparametric methods– Indirect learning: estimate f by minimizing(empirical risk)nXi=1L(Yi, f(Xi))CHAPTER 2 STATISTICAL DECISION THEORY 14'&$%Candidate set for f(x)– too small: underfit data– too large: overfit data– even more important with high-dimensional XCHAPTER 2 STATISTICAL DECISION THEORY 15'&$%Why high-dimensionality is an issue?– data are sparse– local approximation is infeasible– increasing bias and variability with dimensionality– curse of dimensionalityCHAPTER 2 STATISTICAL DECISION THEORY 16'&$%Common considerations for f(x)– linear functions or local linear functions– linear combination of basis function: polynomials,splines, wavelets– let data choose f by penalizing f from roughnessCHAPTER 2 STATISTICAL DECISION THEORY 17'&$%CHAPTER 3 DIRECT LEARNING:PARAMETRIC APPROACHESCHAPTER 3 PARAMETRIC LEARNING 18'&$%Parametric learning– It is one of direct learning methods.– Estimate f(x) using parametric models.– Linear models are often used.CHAPTER 3 PARAMETRIC LEARNING 19'&$%Linear regression model– Target squared loss or zero-one loss.– Assume f(X) = E[Y |X] = XTβ.– The least squared estimationˆf(x) = xT(XTX)−1XTY.CHAPTER 3 PARAMETRIC LEARNING 20'&$%Shrinkage methods– Gain variability reduction by sacrificing predictionaccuracy.– Help to determine important features (variableselection) if any.– Include subset selection, ridge regression, LASSO andet.CHAPTER 3 PARAMETRIC LEARNING 21'&$%Subset selection– Search for the best subset of size k in terms of RSS.– Use leaps and bounds procedure.– Computationally intensive with large dimension.– The best choice of size k is based on Mallow’s CP.CHAPTER 3 PARAMETRIC LEARNING 22'&$%Ridge regression– MinimizenXi=1(Yi− XTiβ)2+ λpXj=1β2j.– Equivalently, minimizenXi=1(Yi− XTiβ)2, subject topXj=1β2j≤ s.– The solutionˆβ = (XTX + λI)−1XTY.– Has Bayesian interpretation.CHAPTER 3 PARAMETRIC LEARNING 23'&$%LASSO– MinimizenXi=1(Yi− XTiβ)2+ λpXj=1|βj|.– Equivalently, minimizenXi=1(Yi− XTiβ)2, subject topXj=1|βj| ≤ s.– This is a convex optimization.– Suppose X to have independent columns:ˆβj= sign(ˆβlse)(|ˆβlse| − λ/2)+.– Nonlinear shrinkage property.CHAPTER 3 PARAMETRIC LEARNING 24'&$%Summary– Subset selection is L0-penalty shrinkage butcomputationally intensive.– Ridge regression is L2-penalty shrinkage and shrinksall coefficients the same way.– LASSO is L1-penalty shrinkage and it is a nonlinearshrinkage.CHAPTER 3 PARAMETRIC LEARNING 25'&$%Other shrinkage methods– Lq-penalty with q ∈ [1, 2]:nXi=1(Yi− XTiβ)2+ λpXj=1|βj|q.– Weighted LASSO (aLASSO):nXi=1(Yi− XTiβ)2+ λpXj=1wj|βj|where wj= |ˆβlse|−q.– SCAD penaltyPpj=1Jλ(|βj|):J0λ(x) = λ(I(x ≤ λ) +(aλ − x)+(a − 1)λI(x > λ)).CHAPTER 3 PARAMETRIC LEARNING 26'&$%−10 −5 0 5 100 2 4 6 8 10(a) Hard thresholdBeta coeffectPenalized coeffHard−threshold−10 −5 0 5 100 2 4 6 8 10(b) Adaptive LASSOBeta coeffectPenalized coeffWeighted L_1 with alpha=3−10 −5 0 5 100 2 4 6 8 10(c) SCADBeta coeffectPenalized coeffSCADCHAPTER 3 PARAMETRIC LEARNING 27'&$%Compare different penalties– All penalties have shrinkage properties.– Some penalties give an oracle property as if the truezeros are known (aLASSO, SCAD).– But aLASSO needs a consistent initial estimate (notsuitable for high-dimensional).– SCAD generally needs large sample size and maysuffer computational difficulty (due to itsnon-convexity).CHAPTER 3 PARAMETRIC LEARNING 28'&$%Logistic discriminant analysis– It is often used when Y is dichotomous or categorical.– AssumeP (Y = k|X =exp{βk0+ XTβk}1 +PKl=1exp{βl0+ XTβl}.– Thenˆf(x) = argmaxknˆβk0+ XTˆβko.CHAPTER 3 PARAMETRIC LEARNING 29'&$%Discriminant analysis– Assume that X given Y = k follows a normaldistribution with mean µkand covariance Σk.– For K = 2, the decision rule (quadratic discriminantanalysis) is based on the sign oflogπ2π1−12(x−ˆµ2)TˆΣ−12(x−ˆµ2)+12(x−ˆµ1)TˆΣ−11(x−ˆµ1).– If assume Σ1= Σ2, this results in linear discriminantanaysis.CHAPTER 3 PARAMETRIC LEARNING 30'&$%Generalization– In parametric methods, features X can be replaced bysome basis


View Full Document

UNC-Chapel Hill BIOS 740 - Statistical Learning

Download Statistical Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Statistical Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Statistical Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?