UCSD CSE 254 - Feature Selection - D139596

Home> Schools> University of California, San Diego> Computer Science & Engineering (CSE) > CSE 254> Feature Selection

UCSD CSE 254 - Feature Selection

School name University of California, San Diego

Course Cse 254- Seminar on Learning Algorithms

Pages 20

Download Save

Unformatted text preview:

1Feature selection, L1 vs. L2 regularization, and rotational invarianceAndrew NgICML 2004Presented by Paul HammonApril 14, 20052Outline1. Background information2. L1-regularized logistic regression3. Rotational invariance and L2-regularized logistic regression4. Experimental setup and results23OverviewThe author discusses regularization as a feature selection approach.For logistic regression he proves that L1-based regularization is superior to L2when there are many features.He proves lower bounds for the sample complexity: the number of training examples needed to learn a classifier.4L1vs. L2regularizationSample complexity of L1-regularized logistic regression is logarithmic in the number of features.The sample complexity of L2-regularized logistic regression is linear in the number of features.Simple experiments verify the superiority of L1over L2regularization.35Background: OverfittingSupervised learning algorithms often over-fit training data.Overfitting occurs when there are so many free parameters that the learning algorithm can fit the training data too closely.This increases the generalization error.The degree of overfitting depends on several factors:• Number of training examples—more are better• Dimensionality of the data—lower dimensionality is better• Design of the learning algorithm—regularization is better(Duda, Hart, & Stork 2001)6Overfitting exampleConsider the learning algorithm called polynomial regression.Polynomial regression allows one to find the best fit polynomial to a set of data points.Leave-one-out cross-validation estimates how well a learning algorithm generalizeswhere y(i)is the class label, x(i)is the training example, f() is the classifier, ?(n-i)is the parameters trained without the ithsample.(from T. Jaakkola lecture notes)∑=−−=niiniixfynCV12)()()();((1θ47VC-dimensionVapnik-Chervonenkis (VC) dimension measures the geometric complexity of a classifier.VC-dimension equals to the number of points the classifier can "shatter."A classifier shatters a set of points by generating all possiblelabelings.For example, the VC-dimension of 2-D linear boundaries is 3.The VC-dimension for most models grows roughly linearly in the number of model parameters (Vapnik, 1982).(from T. Jaakkolalecture notes)8Sample complexityRecall that sample complexity is number of training examples needed to learn a classifier.For (unregularized) discriminative models, the sample complexity grows linearly with VC-dimension.As the number of model parameters increases, more training examples are necessary to generalize well.This is a problem for learning with small data sets with large numbers of dimensions.59Regularization and model complexityAdding regularization to a learning algorithm avoids overfitting.Regularization penalizes the complexity of a learning model.Sparseness is one way to measure complexity. Sparse parameter vectors have few non-zero entriesRegularization based on the zero-norm maximizes sparseness, but zero-norm minimization is an NP-hard problem (Weston et al. 2003).Regularization based on the L1norm drives many parameters to zero.L2norm regularization does not achieve the same level of sparseness (Hastieet al 2001). 2/1122111100||||||||||entries zero-non of sum||||||====∑∑∑===niiniiniiθθθθθθ10Logistic regression (LR) is a binary classifier.The LR model is(1)where y = {0, 1} is the class label, x is a training point,? is the parameter vector, and the data is assumed to be drawn i.i.d. from a distribution D.A logistic function turns linear predictions into [0, 1].To simplify notation, each point x is formed by adding a constant feature, x = [x0, 1]T.This removes the need for a separate offset term.Logistic regression)exp(11);|1(xxypTθθ−+==The logistic function(from A. Ng lecture notes))exp(11)(zzg−+=611To learn LR parameters we use maximum likelihood. Using the model in (1) and an i.i.d. assumption, we havewhere i indexes the training points.We maximize the regularized log-likelihood(2)where a determines the amount of regularization.Training regularized LR}||||,||{||)(with )()(maxargˆ221θθθθαθθθ∈−=RRl);|(log);|(log)( likelihood-log)()()()(θθθiiiiiixypxyp∑∏==l12An equivalent formulation to (2) is(3)For every a in (2), and equivalent B in (3) can be found.This relationship can be seen by forming the Lagrangian.This formulation has the advantage that the regularization parameter B bounds the regularization term R(?).The algorithm and proof for L1-regularized LR use (3).Training regularized LRBRxypiii≤∑)( s.t.);|(logmax)()(θθθ713One metric for error is negative log-likelihood (log-loss)where the subscript (x,y)~D indicates that the expectation is for test samples drawn from D. This has a corresponding empirical log-loss ofThe proofs in this paper use this log-loss function.Another metric is the misclassification error:where t(z) is a threshold function equal to 0 for z < 0.5 and equal to 1 otherwise. This also has a corresponding empirical log-loss of .Metrics of generalization error)];|(log[)(~),(θθε xypEDyxl−=∑=−=miiilxypm1)()();|(log1)(ˆθθε]))(([)(~),(yxgtPTDyxm≠= θθδmδˆ14Regularized regression algorithmThe L1-regularized logistic regression algorithm is:1. Split the training data S into training set S1for the first (1-?)m examples, and hold-out set S2with the remaining ?m examples.2. For B = 0, 1, 2, …, C,Fit a logistic regression model to the training set S1usingStore the resulting parameter vector as ?B. 3. Pick the ?Bthat produces the lowest error score on the hold-out set S2.The theoretical results in this paper only apply to the log-loss error el. BRxypiii≤∑)( s.t.);|(logmax)()(θθθ815L1-regularized logistic regressionTheorem 1: Let any e> 0, d > 0, K = 1, and let m be the number of training examples and n be the number of features, and C = rK (the largest value of B tested).Suppose there exists a parameter vector ?* such that (a) only r components of ?* are non-zero, and (b) every component of ?* = K We want the parameter output by our learning algorithm to perform nearly as well as ?*:To guarantee this with probability at least 1-d, it suffices that where O is a lower bound: O(g(n))=f(n) is defined by saying for some constant c>0 and large enough n, f(n)=c g(n).εθεθε +≤ *)()ˆ(ll)),/1),/1log(,,()((log CKrpolynmεδ⋅Ω=16Background of theorem 1 proofThe proof of this theorem relies on covering number bounds (Anthony & Bartlett,

View Full Document


School:
Email:
New Password:
Confirm Password:

UCSD CSE 254 - Feature Selection

Sign up for free to view:

Please select your school