Rutgers University CS 536 - Artificial Neural Networks - D1595651

Home> Schools> Rutgers University- The State University of New Jersey> (CS) > CS 536> Artificial Neural Networks

DOC PREVIEW

Rutgers University CS 536 - Artificial Neural Networks

School name Rutgers University- The State University of New Jersey

Course Cs 536- Machine Learning

Pages 15

This preview shows page 1-2-3-4-5 out of 15 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Chapter 4:Artificial Neural NetworksCS 536: Machine LearningLittman (Wu, TA)AdministrationiCML-03: instructional Conference onMachine Learninghttp://www.cs.rutgers.edu/~mlittman/courses/ml03/iCML03/Weka assignmenthttp://www.cs.rutgers.edu/~mlittman/courses/ml03/hw1.pdf2Artificial Neural Networks[Read Ch. 4][Review exercises 4.1, 4.2, 4.5, 4.9, 4.11]• Threshold units• Gradient descent• Multilayer networks• Backpropagation• Hidden layer representations• Example: Face Recognition• Advanced topicsConnectionist ModelsConsider humans:• Neuron switching time ~ .001 second• Number of neurons ~ 1010• Connections per neuron ~ 104-5• Scene recognition time ~ .1 second• 100 inference steps doesn't seem likeenoughÆ much parallel computation3Artificial NetworksProperties of artificial neural nets(ANNs):• Many neuron-like thresholdswitching units• Many weighted interconnectionsamong units• Highly parallel, distributed process• Emphasis on tuning weightsautomaticallyWhen to Consider ANNs• Input is high-dimensional discreteor real-valued (e.g. raw sensorinput)• Output is discrete or real valued• Output is a vector of values• Possibly noisy data• Form of target function is unknown• Human readability of result isunimportant4ANNs: Example UsesExamples:• Speech phoneme recognition[Waibel]• Image classification [Kanade,Baluja, Rowley]• Financial prediction• Backgammon [Tesauro]ALVINN drives on highways5PerceptronOr, more succinctly: o(x) = sgn(w ⋅ x)Perceptron Decision SurfaceA single unit can represent some useful functions• What weights representg(x1, x2) = AND(x1, x2)?But some functions not representable• e.g., not linearly separable• Therefore, we'll want networks of these...6Perceptron training rulewi ¨ wi +D wiwhereDwi = h (t-o) xiWhere:• t = c(x) is target value• o is perceptron output• h is small constant (e.g., .1) calledthe learning rate (or step size)Perceptron training ruleCan prove it will converge• If training data is linearly separable• and h sufficiently small7Gradient DescentTo understand, consider simplerlinear unit, whereo = w0 + w1x1 + … + wnxnLet's learn wi's to minimize squarederrorE[w] ≡ 1/2 Sd in D (td-od)2Where D is set of training examplesError Surface8Gradient DescentGradient—E [w] = [∂E/∂w0,∂E/∂w1,…,∂E/∂wn]Training rule:Dw = -h —E [w]in other words:Dwi = -h ∂E/∂wiGradient of Error∂E/∂wi= ∂/∂wi 1/2 Sd (td-od)2= 1/2 Sd ∂/∂wi (td-od)2= 1/2 Sd 2 (td-od) ∂/∂wi (td-od)= Sd (td-od) ∂/∂wi (td-w xd)= Sd (td-od) (-xi,d)9Gradient Descent CodeGRADIENT-DESCENT(training examples, h)Each training example is a pair of the form <x, t>, wherex is the vector of input values, and t is the target outputvalue. h is the learning rate (e.g., .05).• Initialize each wi to some small random value• Until the termination condition is met, Do– Initialize each Dwi to zero.– For each <x, t> in training examples, Do• Input the instance x to the unit and compute theoutput o• For each linear unit weight wi, DoDwi ¨ Dwi + h (t-o)xi– For each linear unit weight wi, Dowi ¨ wi + DwiSummaryPerceptron training rule will succeed if• Training examples are linearly separable• Sufficiently small learning rate hLinear unit training uses gradient descent• Guaranteed to converge to hypothesiswith minimum squared error• Given sufficiently small learning rate h• Even when training data contains noise• Even when training data not H separable10Stochastic Gradient DescentBatch mode Gradient Descent:Do until satisfied1. Compute the gradient —ED [w]2. w ¨ w - —ED [w]Incremental mode Gradient Descent:Do until satisfied• For each training example d in D1. Compute the gradient —Ed [w]2. w ¨ w - —Ed [w]More Stochastic Grad. Desc.ED[w] ≡ 1/2 Sd in D (td-od)2Ed [w] ≡ 1/2 (td-od)2Incremental Gradient Descent canapproximate Batch Gradient Descentarbitrarily closely if h set smallenough11Multilayer NetworksDecision Boundaries12Sigmoid Units(x) is the sigmoid (s-like) function1/(1 + e-x)Deriviates of SigmoidsNice property:d s(x)/dx = s(x) (1-s(x))We can derive gradient decent rules totrain• One sigmoid unit• Multilayer networks of sigmoidunits Æ Backpropagation13Error Gradient for Sigmoid∂E/∂wi= ∂/∂wi 1/2 Sd (td-od)2= 1/2 Sd ∂/∂wi (td-od)2= 1/2 Sd 2 (td-od) ∂/∂wi (td-od)= Sd (td-od) (-∂od/∂wi)= - Sd (td-od) (∂od/∂netd ∂netd/∂wi)Even more…But we know:∂od/∂netd=∂s(netd)/∂netd= od (1- od )∂netd/∂wi = ∂(w · xd) /∂wi = xi,dSo:∂E/∂wi = - Sd (td-od)od (1 - od) xi,d14Backpropagation AlgorithmInitialize all weights to small random numbers.Until satisfied, Do• For each training example, Do1. Input the training example to the network andcompute the network outputs2. For each output unit kdk = ok(1 - ok)(tk-ok)3. For each hidden unit h dh = oh(1 - oh) Sk in outputs wh,k dd4. Update each network weight wi,jwi,j ¨ wi,j + Dwi,j where Dwi,j = h djxi,jMore on Backpropagation• Gradient descent over entirenetwork weight vector• Easily generalized to arbitrarydirected graphs• Will find a local, not necessarilyglobal error minimum– In practice, often works well (can runmultiple times)15More more• Often include weight momentum a Dwi,j(n) = h djxi,j + a Dwi,j(n-1)• Minimizes error over training examples– Will it generalize well to subsequentexamples?• Training can take thousands of iterationsÆ slow!• Using network after training is very

View Full Document