DOC PREVIEW
UB CSE 574 - Neural Network Training!

This preview shows page 1-2-22-23 out of 23 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine Learning ! ! ! ! !Srihari Neural Network Training!Sargur Srihari!Machine Learning ! ! ! ! !Srihari Topics!• Neural network parameters!• Probabilistic problem formulation !• Determining the error function !• Regression!• Binary classification!• Multi-class classification!• Parameter optimization!• Local quadratic approximation!• Use of gradient optimization!• Gradient descent optimization!2Machine Learning ! ! ! ! !Srihari Neural Network parameters!• Linear models for regression and classification can be represented as!• which are linear combinations of basis functions!• In a neural network the basis functions depend on parameters !• During training allow these parameters to be adjusted along with the coefficients wi !3 € y(x,w) = f wjφj(x)j=1M∑$ % & & ' ( ) ) φj(x)φj(x)Machine Learning ! ! ! ! !Srihari Network Training: Sum of squared errors!• Neural networks perform a transformation!• vector x of input variables to vector y of output variables!• To determine w, simple analogy with polynomial curve fitting !• minimize sum-of-squared errors function!• Given set of input vectors {xn}, n=1,..,N and target vectors {tn} minimize the error function!• Consider a more general probabilistic interpretation!4 € E (w) =12|| y(xn,w) − tn||2n=1N∑€ yk(x,w) =σwkj(2)j=1M∑h wji(1)i=1D∑xi$ % & ' ( ) $ % & & ' ( ) ) D input variables M hidden units N training vectorsMachine Learning ! ! ! ! !Srihari Probabilistic View: From activation function f determine Error Function E (as defined by likelihood function) 1. Regression!• f: activation function is identity!• E: Sum-of-squares error/Maximum Likelihood!2. (Multiple Independent) Binary Classifications!• f: activation function is Logistic sigmoid!• E: Cross-entropy error function!3. Multiclass Classification!• f: Softmax outputs!• E: Cross-entropy error function!5 € E(w) = − tnln yn+ (1− tn)ln(1− yn){ }n=1N∑€ E(w) =12y(xn,w) − tn{ }n=1N∑2y(x,w) =σ(wTφ(x)) =11 + exp(−wTφ(x))y(x,w) = wTφ(x) = wjφj(x)j =1M∑€ E(w) = − tknk=1K∑n=1N∑ln yk(xn,w)yk(x,w) =exp(wkφ(x))exp(wjφ(x))j∑Machine Learning ! ! ! ! !Srihari 1. Probabilistic View: Regression !6 • Output is a single target variable t that can take any real value!• Assuming t is Gaussian distributed with an x-dependent mean!!• Likelihood function!• Taking negative logarithm, we get the negative log-likelihood!• which can be used to learn parameters w and β p( t | x, w ) = N(t | y(x, w),β−1)€ p(t | x,w,β) =n=1N∏N(tn| y(xn,w),β−1)€ β2y(xn,w) − tn{ }n=1N∑2−N2lnβ+N2ln(2π)Machine Learning ! ! ! ! !Srihari • Likelihood Function could be used to learn parameters w and β"• Usually done in a Bayesian treatment!• In neural network literature minimizing error is used!• They are equivalent here. Sum of squared errors is!• Its smallest value occurs when ∇E(w)=0!• Since E(w) is non-convex: • Solution wML found using iterative optimization wt+1=wt+∆wt!• Gradient descent (discussed later in this lecture))!• Another solution is back-propagation !• Since regression output is same as activation yk=ak, so!• Having found wML the value of βML can also be found using Regression Error Function!7 € E(w) =12y(xn,w) − tn{ }n=1N∑2lnβ/ 2πβ=1Ny(xn,wML) − tn{ }n=1N∑2∂E∂ak= yk− tk ak= wki(2)xi+ wk 0(2)i=1M∑ where k = 1,..,KMachine Learning ! ! ! ! !Srihari 2. Binary Classification!• Single target variable t where t=1 denotes C1 and t =0 denotes C2 • Consider network with single output whose activation function is logistic sigmoid!• so that 0 < y(x,w) < 1!• Interpret y(x,w) as conditional probability p(C1|x) • Conditional distribution of targets given inputs 8 € y =σ(a) =11+ exp(−a)€ p(t | x,w) = y(x,w)t{1− y(x,w)}1−tMachine Learning ! ! ! ! !Srihari Binary Classification Error Function!• Error function is negative log-likelihood which in this case is a Cross-Entropy error function!!• where yn denotes y(xn ,w) • Using cross-entropy error function instead of sum of squares leads to faster training and improved generalization 9 € E(w) = − tnln yn+ (1− tn)ln(1− yn){ }n=1N∑Machine Learning ! ! ! ! !Srihari 2. K Separate Binary Classifications!• Network has K outputs each with a logistic sigmoid activation function!• Associated with each output is a binary class label tk • Taking negative logarithm of likelihood function!E(w) = − {tnkk =1K∑n=1N∑ln ynk+ (1 − tnk)ln(1 − ynk)}where ynk denotes yk(xn,w)p(t | x,w ) = yk(x,w)tk[1− yk(x,w)]1−tkk=1K∏10Machine Learning ! ! ! ! !Srihari 3. Multiclass Classification!• Each input assigned to one of K classes!• Binary target variables have 1-of-K coding scheme!• Network outputs are interpreted as!• Leads to following error function!• Output unit activation function is given by softmax!11 € tk∈ {0,1}€ yk(x,w) = p(tk= 1 | x)E(w) = − tknk =1K∑n=1N∑ln yk(xn,w)yk(x,w) =exp(ak(x,w))exp(aj(x,w))j∑yk(x,w) =exp(wkφ(x))exp(wjφ(x))j∑Machine Learning ! ! ! ! !Srihari Parameter Optimization!• Task: Find weight vector w which minimizes the chosen function E(w) • Geometrical picture of error function!• Error function has a highly nonlinear!• dependence!12Machine Learning ! ! ! ! !Srihari Parameter Optimization: Geometrical View!E(w): surface sitting over weight space!• wA:a local minimum • wB global minimum!• Need to find minimum!• At point wC local gradientgradient!• is given by vector!• points in direction of greatest rate of increase of E(w) • Negative gradient points to rate of greatest decrease! !€ ∇E(w)13Machine Learning ! ! ! ! !Srihari Finding w where E(w) is smallest!14 Small step from w to w+δw leads to!• change in error function!• Minimum of E(w) will occur when!• Points at which gradient vanishes are! stationary points: minima, maxima, saddle !!Complex surface!No hope of finding analytical solution to equation !!! € δE ≈δwT∇E (w)€ ∇E(w) = 0€ ∇E(w) = 0Machine Learning ! ! ! ! !Srihari Iterative Numerical Procedure for Minima!15 • Since there is no analytical solution! choose initial w(0) and


View Full Document

UB CSE 574 - Neural Network Training!

Download Neural Network Training!
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Neural Network Training! and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Neural Network Training! 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?