DOC PREVIEW
Rutgers University CS 536 - Artificial Neural Networks

This preview shows page 1-2-3-4-5 out of 15 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Chapter 4:Artificial Neural NetworksCS 536: Machine LearningLittman (Wu, TA)AdministrationiCML-03: instructional Conference onMachine Learninghttp://www.cs.rutgers.edu/~mlittman/courses/ml03/iCML03/Weka assignmenthttp://www.cs.rutgers.edu/~mlittman/courses/ml03/hw1.pdf2Artificial Neural Networks[Read Ch. 4][Review exercises 4.1, 4.2, 4.5, 4.9, 4.11]• Threshold units• Gradient descent• Multilayer networks• Backpropagation• Hidden layer representations• Example: Face Recognition• Advanced topicsConnectionist ModelsConsider humans:• Neuron switching time ~ .001 second• Number of neurons ~ 1010• Connections per neuron ~ 104-5• Scene recognition time ~ .1 second• 100 inference steps doesn't seem likeenoughÆ much parallel computation3Artificial NetworksProperties of artificial neural nets(ANNs):• Many neuron-like thresholdswitching units• Many weighted interconnectionsamong units• Highly parallel, distributed process• Emphasis on tuning weightsautomaticallyWhen to Consider ANNs• Input is high-dimensional discreteor real-valued (e.g. raw sensorinput)• Output is discrete or real valued• Output is a vector of values• Possibly noisy data• Form of target function is unknown• Human readability of result isunimportant4ANNs: Example UsesExamples:• Speech phoneme recognition[Waibel]• Image classification [Kanade,Baluja, Rowley]• Financial prediction• Backgammon [Tesauro]ALVINN drives on highways5PerceptronOr, more succinctly: o(x) = sgn(w ⋅ x)Perceptron Decision SurfaceA single unit can represent some useful functions• What weights representg(x1, x2) = AND(x1, x2)?But some functions not representable• e.g., not linearly separable• Therefore, we'll want networks of these...6Perceptron training rulewi ¨ wi +D wiwhereDwi = h (t-o) xiWhere:• t = c(x) is target value• o is perceptron output• h is small constant (e.g., .1) calledthe learning rate (or step size)Perceptron training ruleCan prove it will converge• If training data is linearly separable• and h sufficiently small7Gradient DescentTo understand, consider simplerlinear unit, whereo = w0 + w1x1 + … + wnxnLet's learn wi's to minimize squarederrorE[w] ≡ 1/2 Sd in D (td-od)2Where D is set of training examplesError Surface8Gradient DescentGradient—E [w] = [∂E/∂w0,∂E/∂w1,…,∂E/∂wn]Training rule:Dw = -h —E [w]in other words:Dwi = -h ∂E/∂wiGradient of Error∂E/∂wi= ∂/∂wi 1/2 Sd (td-od)2= 1/2 Sd ∂/∂wi (td-od)2= 1/2 Sd 2 (td-od) ∂/∂wi (td-od)= Sd (td-od) ∂/∂wi (td-w xd)= Sd (td-od) (-xi,d)9Gradient Descent CodeGRADIENT-DESCENT(training examples, h)Each training example is a pair of the form <x, t>, wherex is the vector of input values, and t is the target outputvalue. h is the learning rate (e.g., .05).• Initialize each wi to some small random value• Until the termination condition is met, Do– Initialize each Dwi to zero.– For each <x, t> in training examples, Do• Input the instance x to the unit and compute theoutput o• For each linear unit weight wi, DoDwi ¨ Dwi + h (t-o)xi– For each linear unit weight wi, Dowi ¨ wi + DwiSummaryPerceptron training rule will succeed if• Training examples are linearly separable• Sufficiently small learning rate hLinear unit training uses gradient descent• Guaranteed to converge to hypothesiswith minimum squared error• Given sufficiently small learning rate h• Even when training data contains noise• Even when training data not H separable10Stochastic Gradient DescentBatch mode Gradient Descent:Do until satisfied1. Compute the gradient —ED [w]2. w ¨ w - —ED [w]Incremental mode Gradient Descent:Do until satisfied• For each training example d in D1. Compute the gradient —Ed [w]2. w ¨ w - —Ed [w]More Stochastic Grad. Desc.ED[w] ≡ 1/2 Sd in D (td-od)2Ed [w] ≡ 1/2 (td-od)2Incremental Gradient Descent canapproximate Batch Gradient Descentarbitrarily closely if h set smallenough11Multilayer NetworksDecision Boundaries12Sigmoid Units(x) is the sigmoid (s-like) function1/(1 + e-x)Deriviates of SigmoidsNice property:d s(x)/dx = s(x) (1-s(x))We can derive gradient decent rules totrain• One sigmoid unit• Multilayer networks of sigmoidunits Æ Backpropagation13Error Gradient for Sigmoid∂E/∂wi= ∂/∂wi 1/2 Sd (td-od)2= 1/2 Sd ∂/∂wi (td-od)2= 1/2 Sd 2 (td-od) ∂/∂wi (td-od)= Sd (td-od) (-∂od/∂wi)= - Sd (td-od) (∂od/∂netd ∂netd/∂wi)Even more…But we know:∂od/∂netd=∂s(netd)/∂netd= od (1- od )∂netd/∂wi = ∂(w · xd) /∂wi = xi,dSo:∂E/∂wi = - Sd (td-od)od (1 - od) xi,d14Backpropagation AlgorithmInitialize all weights to small random numbers.Until satisfied, Do• For each training example, Do1. Input the training example to the network andcompute the network outputs2. For each output unit kdk = ok(1 - ok)(tk-ok)3. For each hidden unit h dh = oh(1 - oh) Sk in outputs wh,k dd4. Update each network weight wi,jwi,j ¨ wi,j + Dwi,j where Dwi,j = h djxi,jMore on Backpropagation• Gradient descent over entirenetwork weight vector• Easily generalized to arbitrarydirected graphs• Will find a local, not necessarilyglobal error minimum– In practice, often works well (can runmultiple times)15More more• Often include weight momentum a Dwi,j(n) = h djxi,j + a Dwi,j(n-1)• Minimizes error over training examples– Will it generalize well to subsequentexamples?• Training can take thousands of iterationsÆ slow!• Using network after training is very


View Full Document
Download Artificial Neural Networks
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Artificial Neural Networks and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Artificial Neural Networks 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?