Rutgers University CS 536 - Artificial Neural Networks - D585667

Home> Schools> Rutgers University- The State University of New Jersey> (CS) > CS 536> Artificial Neural Networks

Rutgers University CS 536 - Artificial Neural Networks

School name Rutgers University- The State University of New Jersey

Pages 32

Download Save

Unformatted text preview:

1CS536: Machine Learning Artificial Neural NetworksFall 2005Ahmed ElgammalDept of Computer ScienceRutgers UniversityCS 536 – Artificial Neural Networks - - 2Neural NetworksBiological Motivation: Brain• Networks of processing units (neurons) with connections (synapses) between them• Large number of neurons: 1011• Large connectitivity: each connected to, on average, 104 others• Switching time 10-3second• Parallel processing• Distributed computation/memory• Processing is done by neurons and the memory is in the synapses• Robust to noise, failures2CS 536 – Artificial Neural Networks - - 3Neural NetworksCharacteristic of Biological Computation• Massive Parallelism• Locality of Computation• Adaptive (Self Organizing)• Representation is DistributedCS 536 – Artificial Neural Networks - - 4Understanding the Brain• Levels of analysis (Marr, 1982)1. Computational theory2. Representation and algorithm3. Hardware implementation• Reverse engineering: From hardware to theory• Parallel processing: SIMD vs MIMDNeural net: SIMD with modifiable local memoryLearning: Update by training/experience3CS 536 – Artificial Neural Networks - - 5• ALVINN system Pomerleau (1993)• Many successful examples:Speech phoneme recognition [Waibel]Image classification [Kanade, Baluja, Rowley]Financial predictionBackgammon [Tesauro]CS 536 – Artificial Neural Networks - - 6When to use ANN• Input is high-dimensional discrete or real-valued (e.g. raw sensor input). Inputs can be highly correlated or independent.• Output is discrete or real valued• Output is a vector of values• Possibly noisy data. Data may contain errors• Form of target function is unknown• Long training time are acceptable• Fast evaluation of target function is required• Human readability of learned target function is unimportant⇒ ANN is much like a black-box4CS 536 – Artificial Neural Networks - - 7Perceptron(Rosenblatt, 1962)[][]TdTdTdjjjx,...,x,w,...,w,wwxwy110011===+=∑=xwxwCS 536 – Artificial Neural Networks - - 8PerceptronOr, more succinctly: o(x) = sgn(w⋅x)Activation Rule:Linear threshold (step unit)5CS 536 – Artificial Neural Networks - - 9What a Perceptron Does • 1 dimensional case:• Regression: y=wx+w0• Classification: y=1(wx+w0>0)ww0yxx0=+1ww0yxsw0yxCS 536 – Artificial Neural Networks - - 10Perceptron Decision SurfacePerceptron as hyperplane decision surface in the n-dimensional input spaceThe perceptron outputs 1 for instances lying on one side of the hyperplane and outputs –1 for instances on the other sideData that can be separated by a hyperplane: linearly separable6CS 536 – Artificial Neural Networks - - 11Perceptron Decision SurfaceA single unit can represent some useful functions• What weights representg(x1, x2) = AND(x1, x2)? Majority, OrBut some functions not representable• e.g., not linearly separable• Therefore, we'll want networks of these...CS 536 – Artificial Neural Networks - - 12Perceptron training rulewi← wi+∆ wiwhere∆ wi= η (t-o) xiWhere:• t = c(x) is target value• o is perceptron output• η is small constant (e.g., .1) called the learning rate (or step size)7CS 536 – Artificial Neural Networks - - 13Perceptron training ruleCan prove it will converge• If training data is linearly separable•and η sufficiently small• Perceptron Conversion Theorem (Rosenblatt): if the data are linearly separable then the perceptron learning algorithm converges in finite time.CS 536 – Artificial Neural Networks - - 14Gradient Descent – Delta RuleAlso know as LMS (least mean squares) rule or widrow-Hoff rule.To understand, consider simpler linear unit, whereo = w0+ w1x1+ … + wnxnLet's learn wi's to minimize squared errorE[w] ≡ 1/2Σd in D(td-od)2Where D is set of training examples8CS 536 – Artificial Neural Networks - - 15Error SurfaceCS 536 – Artificial Neural Networks - - 16Gradient DescentGradient∇E[w] = [∂E/∂w0,∂E/∂w1,…,∂E/∂wn]When interpreted as a vector in weight space, the gradient specifies the direction that produces the steepest increase in ETraining rule:∆w = -η ∇E[w]in other words:∆wi= -η ∂E/∂wiThis results in the following update rule:∆wi= η Σd(td-od) (xi,d)9CS 536 – Artificial Neural Networks - - 17Gradient of Error∂E/∂wi= ∂/∂wi1/2 Σd(td-od)2= 1/2 Σd∂/∂wi(td-od)2= 1/2 Σd2(td-od) ∂/∂wi(td-od)=Σd(td-od) ∂/∂wi(td-wxd)=Σd(td-od) (-xi,d)Learning Rule:∆wi= -η ∂E/∂wi⇒∆wi= η Σd(td-od) (xi,d)CS 536 – Artificial Neural Networks - - 18Stochastic Gradient DescentBatch mode Gradient Descent:Do until satisfied1. Compute the gradient ∇ED[w] 2. w ← w - ∇ED[w] Incremental mode Gradient Descent:Do until satisfied• For each training example d in D1. Compute the gradient ∇Ed[w] 2. w ← w - ∇Ed[w]10CS 536 – Artificial Neural Networks - - 19More Stochastic Grad. Desc.ED[w] ≡ 1/2 Σd in D(td-od)2Ed[w] ≡ 1/2 (td-od)2Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η set small enoughIncremental Learning Rule: ∆wi= η (td-od) (xi,d)Delta Rule: ∆wi= η (t-o) (xi)δ = (t-o) CS 536 – Artificial Neural Networks - - 20Gradient Descent CodeGRADIENT-DESCENT(training examples, η)Each training example is a pair of the form <x, t>, where x is the vector of input values, and t is the target output value. ηis the learning rate (e.g., .05).• Initialize each wito some small random value• Until the termination condition is met, Do– Initialize each ∆wito zero.– For each <x, t> in training examples, Do• Input the instance x to the unit and compute the output o• For each linear unit weight wi, Do∆wi←∆wi+ η (t-o)xi– For each linear unit weight wi, Dowi←wi+ ∆wi11CS 536 – Artificial Neural Networks - - 21SummaryPerceptron training rule will succeed if• Training examples are linearly separable• Sufficiently small learning rate ηLinear unit training uses gradient descent (delta rule)• Guaranteed to converge to hypothesis with minimum squared error• Given sufficiently small learning rate η• Even when training data contains noise• Even when training data not H separableCS 536 – Artificial Neural Networks - - 22K OutputskkiikkiiTiiyyCooyomaxif chooseexpexp===∑xwClassification:Regression:xyxwW==+=∑=Tiidjjijiwxwy01Linear Map from Rd⇒Rk12CS 536 – Artificial Neural Networks - - 23Training• Online (instances seen one by one) vs batch (whole sample) learning:– No

View Full Document


School:
Email:
New Password:
Confirm Password:

Rutgers University CS 536 - Artificial Neural Networks

Sign up for free to view:

Please select your school