Machine Learning ! ! ! ! !Srihari Neural Network Training!Sargur Srihari!Machine Learning ! ! ! ! !Srihari Topics!• Neural network parameters!• Probabilistic problem formulation !• Determining the error function !• Regression!• Binary classification!• Multi-class classification!• Parameter optimization!• Local quadratic approximation!• Use of gradient optimization!• Gradient descent optimization!2Machine Learning ! ! ! ! !Srihari Neural Network parameters!• Linear models for regression and classification can be represented as!• which are linear combinations of basis functions!• In a neural network the basis functions depend on parameters !• During training allow these parameters to be adjusted along with the coefficients wi !3 € y(x,w) = f wjφj(x)j=1M∑$ % & & ' ( ) ) φj(x)φj(x)Machine Learning ! ! ! ! !Srihari Network Training: Sum of squared errors!• Neural networks perform a transformation!• vector x of input variables to vector y of output variables!• To determine w, simple analogy with polynomial curve fitting !• minimize sum-of-squared errors function!• Given set of input vectors {xn}, n=1,..,N and target vectors {tn} minimize the error function!• Consider a more general probabilistic interpretation!4 € E (w) =12|| y(xn,w) − tn||2n=1N∑€ yk(x,w) =σwkj(2)j=1M∑h wji(1)i=1D∑xi$ % & ' ( ) $ % & & ' ( ) ) D input variables M hidden units N training vectorsMachine Learning ! ! ! ! !Srihari Probabilistic View: From activation function f determine Error Function E (as defined by likelihood function) 1. Regression!• f: activation function is identity!• E: Sum-of-squares error/Maximum Likelihood!2. (Multiple Independent) Binary Classifications!• f: activation function is Logistic sigmoid!• E: Cross-entropy error function!3. Multiclass Classification!• f: Softmax outputs!• E: Cross-entropy error function!5 € E(w) = − tnln yn+ (1− tn)ln(1− yn){ }n=1N∑€ E(w) =12y(xn,w) − tn{ }n=1N∑2y(x,w) =σ(wTφ(x)) =11 + exp(−wTφ(x))y(x,w) = wTφ(x) = wjφj(x)j =1M∑€ E(w) = − tknk=1K∑n=1N∑ln yk(xn,w)yk(x,w) =exp(wkφ(x))exp(wjφ(x))j∑Machine Learning ! ! ! ! !Srihari 1. Probabilistic View: Regression !6 • Output is a single target variable t that can take any real value!• Assuming t is Gaussian distributed with an x-dependent mean!!• Likelihood function!• Taking negative logarithm, we get the negative log-likelihood!• which can be used to learn parameters w and β p( t | x, w ) = N(t | y(x, w),β−1)€ p(t | x,w,β) =n=1N∏N(tn| y(xn,w),β−1)€ β2y(xn,w) − tn{ }n=1N∑2−N2lnβ+N2ln(2π)Machine Learning ! ! ! ! !Srihari • Likelihood Function could be used to learn parameters w and β"• Usually done in a Bayesian treatment!• In neural network literature minimizing error is used!• They are equivalent here. Sum of squared errors is!• Its smallest value occurs when ∇E(w)=0!• Since E(w) is non-convex: • Solution wML found using iterative optimization wt+1=wt+∆wt!• Gradient descent (discussed later in this lecture))!• Another solution is back-propagation !• Since regression output is same as activation yk=ak, so!• Having found wML the value of βML can also be found using Regression Error Function!7 € E(w) =12y(xn,w) − tn{ }n=1N∑2lnβ/ 2πβ=1Ny(xn,wML) − tn{ }n=1N∑2∂E∂ak= yk− tk ak= wki(2)xi+ wk 0(2)i=1M∑ where k = 1,..,KMachine Learning ! ! ! ! !Srihari 2. Binary Classification!• Single target variable t where t=1 denotes C1 and t =0 denotes C2 • Consider network with single output whose activation function is logistic sigmoid!• so that 0 < y(x,w) < 1!• Interpret y(x,w) as conditional probability p(C1|x) • Conditional distribution of targets given inputs 8 € y =σ(a) =11+ exp(−a)€ p(t | x,w) = y(x,w)t{1− y(x,w)}1−tMachine Learning ! ! ! ! !Srihari Binary Classification Error Function!• Error function is negative log-likelihood which in this case is a Cross-Entropy error function!!• where yn denotes y(xn ,w) • Using cross-entropy error function instead of sum of squares leads to faster training and improved generalization 9 € E(w) = − tnln yn+ (1− tn)ln(1− yn){ }n=1N∑Machine Learning ! ! ! ! !Srihari 2. K Separate Binary Classifications!• Network has K outputs each with a logistic sigmoid activation function!• Associated with each output is a binary class label tk • Taking negative logarithm of likelihood function!E(w) = − {tnkk =1K∑n=1N∑ln ynk+ (1 − tnk)ln(1 − ynk)}where ynk denotes yk(xn,w)p(t | x,w ) = yk(x,w)tk[1− yk(x,w)]1−tkk=1K∏10Machine Learning ! ! ! ! !Srihari 3. Multiclass Classification!• Each input assigned to one of K classes!• Binary target variables have 1-of-K coding scheme!• Network outputs are interpreted as!• Leads to following error function!• Output unit activation function is given by softmax!11 € tk∈ {0,1}€ yk(x,w) = p(tk= 1 | x)E(w) = − tknk =1K∑n=1N∑ln yk(xn,w)yk(x,w) =exp(ak(x,w))exp(aj(x,w))j∑yk(x,w) =exp(wkφ(x))exp(wjφ(x))j∑Machine Learning ! ! ! ! !Srihari Parameter Optimization!• Task: Find weight vector w which minimizes the chosen function E(w) • Geometrical picture of error function!• Error function has a highly nonlinear!• dependence!12Machine Learning ! ! ! ! !Srihari Parameter Optimization: Geometrical View!E(w): surface sitting over weight space!• wA:a local minimum • wB global minimum!• Need to find minimum!• At point wC local gradientgradient!• is given by vector!• points in direction of greatest rate of increase of E(w) • Negative gradient points to rate of greatest decrease! !€ ∇E(w)13Machine Learning ! ! ! ! !Srihari Finding w where E(w) is smallest!14 Small step from w to w+δw leads to!• change in error function!• Minimum of E(w) will occur when!• Points at which gradient vanishes are! stationary points: minima, maxima, saddle !!Complex surface!No hope of finding analytical solution to equation !!! € δE ≈δwT∇E (w)€ ∇E(w) = 0€ ∇E(w) = 0Machine Learning ! ! ! ! !Srihari Iterative Numerical Procedure for Minima!15 • Since there is no analytical solution! choose initial w(0) and
View Full Document