UB CSE 574 - Regularization in Neural Networks - D573707

Home> Schools> University at Buffalo, The State University of New York> Computer Science & Engineering (CSE) > CSE 574> Regularization in Neural Networks

UB CSE 574 - Regularization in Neural Networks

School name University at Buffalo, The State University of New York

Course Cse 574- Introduction to Machine Learning

Pages 22

Download Save

Unformatted text preview:

Machine Learning ! ! ! ! !Srihari Regularization in Neural Networks"!Sargur Srihari!1Machine Learning ! ! ! ! !Srihari Topics in Neural Network Regularization!• What is regularization?!• Methods!1. Determining optimal number of hidden units!2. Use of regularizer in error function!• Linear Transformations and Consistent Gaussian priors!3. Early stopping!• Invariances!• Tangent propagation!• Training with transformed data!• Convolutional networks!• Soft weight sharing!2Machine Learning ! ! ! ! !Srihari What is Regularization?!• In machine learning (also, statistics and inverse problems):!• introducing additional information to prevent over-fitting !!(or solve ill-posed problem)!• This information is usually a penalty for complexity, e.g., !• restrictions for smoothness !• bounds on the vector space norm!• Theoretical justification for regularization:!• attempts to impose Occam's razor on the solution!• From a Bayesian point of view!• Regularization corresponds to imposition of prior distributions on model parameters!3Machine Learning ! ! ! ! !Srihari 1. Regularization by determining no. of hidden units!• Number of input and output units is determined by dimensionality of data set!• Number of hidden units M is a free parameter!• Adjusted to get best predictive performance!• Possible approach is to get maximum likelihood estimate of M for balance between under-fitting and over-fitting!4Machine Learning ! ! ! ! !Srihari Effect of Varying Number of Hidden Units! Sinusoidal Regression Problem!!Two layer network trained on 10 data points!!M = 1, 3 and 10 hidden units!!Minimizing sum-of-squared error function!Using conjugate gradient descent!!Generalization error is not a simple function of M due to presence of local minima in error function !Machine Learning ! ! ! ! !Srihari Using Validation Set to determine no of hidden units!6 Number of hidden units, M Sum of squares!Test error!for polynomial data!30 random starts for each M Overall best validation!Set performance happened at!M=8 Plot a graph choosing random starts and different numbers of! hidden units MMachine Learning ! ! ! ! !Srihari 2. Regularization using Simple Weight Decay!• Generalization error is not a simple function of M • Due to presence of local minima!• Need to control network complexity to avoid over-fitting!• Choose a relatively large M and control complexity by addition of regularization term!• Simplest regularizer is weight decay!• Effective model complexity determined by choice of regularization coefficient λ!• Regularizer is equivalent to a zero mean Gaussian prior over weight vector w!• Simple weight decay has certain shortcomings!7 € ˜ E (w) = E (w) +λ2wTwMachine Learning ! ! ! ! !Srihari Consistent Gaussian priors!• Simple weight decay is inconsistent with certain scaling properties of network mappings!• To show this, consider a multi-layer perceptron network with two layers of weights and linear output units!• Set of input variables {xi} and output variables {yi} • Activations of hidden units in first layer have the form!• Activations of output units are!8 € zj= h wjixii∑+ wj 0# $ % & ' ( € yk= wkjj∑zj+ wk 0Machine Learning ! ! ! ! !Srihari Linear Transformations of input/output Variables!9 • Suppose we perform a linear transformation of input data!• We can arrange for mapping performed by network to be unchanged!• if we transform weights and biases from inputs to hidden units as!• Similar linear transformation of output variables of network is!• Can be achieved by transformation of second layer weights and biases!€ xi→˜ x i= axi+ b € wji→˜ w ji=1awji and wj 0= wj 0−bawjii∑€ yk→˜ y k= cyk+ d € wkj→˜ w kj= cwkj and wk 0= cwk 0+ dMachine Learning ! ! ! ! !Srihari Desirable invariance property of regularizer!• If we train one network using original data!• Another for which input and/or target variables are transformed by one of the linear transformations!• Then they should only differ by the weights as given!• Regularizer should have this property!• Otherwise it arbitrarily favors one solution over another!• Simple weight decay does not have this property!• Treats all weights and biases on an equal footing!10Machine Learning ! ! ! ! !Srihari Regularizer invariant under linear transformation!• A regularizer invariant to re-scaling of weights and shifts of biases is!!• where W1 are weights of first layer and W2 of second layer!• This regularizer remains unchanged under weight transformations provided!• However, it corresponds to prior of the form!!• This is an improper prior which cannot be normalized!• Leads to difficulties in selecting regularization coefficients and in model comparison within Bayesian framework!• Instead include separate priors for biases with their own hyper-parameters!€ λ12w2w ∈W1∑+λ22w2w ∈W2∑ € λ1→ a1/ 2λ1 and λ2→ c−1/ 2λ2 € p(w |α1,α2) α exp −α12w2w ∈W1∑−α22w2w ∈W2∑& ' ( ( ) * + + α1 and α2 are hyper-parametersMachine Learning ! ! ! ! !Srihari Example: Effect of hyperparameters!12 Priors are governed by !four hyper-parameters! α1b, precision of Gaussian ! of first layer bias!α1w, …..of first layer weights!α2b,…. ..of second layer bias!α2w, …. of second layer weights!Network with single input (x value ranging from -1 to +1),!single linear output (y value ranging from -60 to +40)!!12 hidden units with tanh activation functions! Draw samples from prior, plot network functions!Five samples correspond to five colors!For each setting function is learnt and plotted!!input outputMachine Learning ! ! ! ! !Srihari 3. Early Stopping!• Alternative to regularization!• In controlling complexity!• Error measured with an independent validation set !• shows initial decrease in error and then an increase!• Training stopped at point of smallest!!error with validation data!• Effectively limits network complexity!13 Training Set Error!Validation Set Error!    Iteration Step Iteration StepMachine Learning ! ! ! ! !Srihari Interpreting the effect of Early Stopping!• Consider quadratic error function!• Axes in weight space are parallel to

View Full Document


School:
Email:
New Password:
Confirm Password:

UB CSE 574 - Regularization in Neural Networks

Sign up for free to view:

Please select your school