Learning with Neural NetworksReiner [email protected] many others whose copyright I shamelessly infringed by including their illustrations in this presentation.1OutlineBiology: Neurons and the BrainComputational Model of the NeuronLogical NeuronPerceptronADAptive LINear Element (ADALINE)Multi-Layer PerceptronError Back-Propagation2Biological NeuronAxonAP→SynapseLEARNINGLTP or LTDexcitatory or−→inhibitoryNeurotransmitterDendriteEPSPor→IPSPSomaAP→AxonSlow (≈10ms), but MASSIVELY REDUPLICATED (1011neurons, 1014synapses)3Computational Neuron Model (Node)ax== F(x w wTx )Σ Fθw1=0xx0iisigmoidGaussian11011piece−wiselinearlinearstep1F4Logical NeuronMcCulloch & Pitts, 1943• wi>0∈ {−Φ, 1} (fixed; NO learning), w0= θ ∈ {0} ∪ Z−, F = step function• syncronous updates: a(t + 1) =(1 if ~wT~x(t) ≥ 00 otherwise−1−1 −1−1 FLIP−FLOP−11 SET RESET−21111a (t) = a (t−2) AND a (t−1)3aaa123111 −20NAND−11Model of the Brain?• N nodes, p(0) ∈ [0, 1] fraction of initially active nodes• NO external inputs, i.e., ~x = ~a• n = ne(excitatory) + ni(inhibitory) incoming connections per node from random(other) nodes• ne ni, Φ 0, θ ≈ −1 ⇔ limit cycles . . . ~a(t), ~a(t+1), . . . , ~a(t+k) = ~a(t) . . .5PerceptronRosenblatt, 1958 vs. Minsky & Papert, 1969• more complex, but subsumed by the following `modern' perceptronxdx1axx=xi iw1Σθ=1 if0> 0_binary inputsotherwisereal weightswTxw0• two sets of input patternsX+= {~x+1, ~x+2, . . . },X−= {~x−1, ~x−2, . . . }, X+∩ X−= ∅• GOAL: ~w (initially random) so thata =(1 if ~x ∈ X+0 if ~x ∈ X−• Solution: wi= wi+ µxi(t − a) wheret (a) is the target (actual) output for~x and µ ∈]0, 1] is the learning rate• guaranteed to find a `correct' ~w, provided X+and X−are linearly separable• very rare for |X| = |X+∪ X−| d + 1: P (|X|, d) =12|X|−1dPi=0|X| − 1i• P (|X| = 4, d = 2) = 14/16 (XOR and EQU not linearly separable)6ADAptive LINear Element (ADALINE)Widrow & Hoff, 1960xdx1a wTx=a x=R RmnmxnT(W )xx=xi iw1Σθreal inputsreal weightsw0• paired input and target output pat-terns (~x ∈ Rm,~t ∈ Rn)• GOAL: weight matrix W so that~a = WT~x =~t for each pair (~x,~t)• Solution: Gradient Descent• Error: E =12Pl(tl− al)2• Gradient:δEδwi,j= (ti− ai)(−δaiδwi,j)•δaiδwi,j=δ ~wTi~xδwi,j=δPlwi,lxlδwi,j= xj• Least Mean Square (LMS) or ∆-rule:wi,j= wi,j− µδEδwi,j= . . .· · · = wi,j+µ∆ixjwhere ∆i= ti−ai• finds (global) minimum of E7Multi-Layer PerceptronRosenblatt, 1958 (original Perceptron had multiple layers)x2x21xx21xx2<01x>0?1x>0<0<0input hidden outputw x + w x + = 0θw x + w x + = 0θw x + w x + = 0θ1,111,2 21222,212,11 122−2θ1θ211• input → hidden → output• computes the function ~aO=FO(WTFH(VT~x)) where V (W )contains the incoming weights ofthe hidden (output) nodes• used for classification and func-tion approximation• typically, F(x) = σ(x) =11+e−sx(sigmoid function)• if trained via Gradient Descend,F needs to have a derivative F'• σ0(x) = sσ(x)(1 − σ(x))8Error Back-PropagationRumelhart, 1986• generalization of the LMS or δ-rule for Multi-Layer Perceptrons• Error: E =12Pl(tl− al)2• Gradient of E with respect to weight wi,jfrom jthhidden to ithoutput node:.δEδwi,j= (ti− ai)(−δaiδwi,j) = −δF( ~wTi~aH)δwi,j(ti− ai) = −F'( ~wTi~aH)aH,j(ti− ai)so that wi,j= wi,j+ µ∆iaH,jwith ∆i= F'( ~wTi~aH)(ti− ai)• Gradient of E with respect to weight vi,jfrom jthinput to ithhidden node:.δEδvi,j=Pl(tl− al)(−δalδvi,j) = −PlδF( ~wTl~aH)δvi,j(tl− al) =−PlF'( ~wTl~aH)δ ~wTl~aHδvi,j(tl− al) = −Pl∆lδPkwl,kaH,kδvi,j= −Pl∆lwl,iδaH,iδvi,j=−Pl∆lwl,iδF (~vTi~x)δvi,j= −Pl∆lwl,iF0(~vTi~x)δ~vTi~xδvi,j= −Pl∆lwl,iF0(~vTi~x)xj=−∆ixjwhere ∆i= F0(~vTi~x)Plwl,i∆lso that vi,j= vi,j+ µ∆ixj• NOT guaranteed to find the global minimum of E (is NOT monotonously decreasingdue to non-linearities)9Error Back-PropagationPower• with enough hidden units, can approximate any arbitrary function as precisely asdesiredShortcomings• premature convergence on local minima• very very slow• representational `mysticism' (what was learned?)• many parameters to chose (how many hidden units? µ? FH? FO?)• overfitting (`memorized' training examples; performs poorly on unseen inputs)`Better' Heuristics• EBP with Momentum: ∆(n)wi,j= µ∆(n)ia(n)H,j+ α∆(n−1)wi,j• Resilient EBP (RPROP): ∆wi,j= −∆i,jsign(δEδwi,j) where ∆i,j(wi,j's learning rate)increases (decreases) if sign(δEδwi,j) remains the same (changes) from one timestep to the next10References• McCulloch, W. & Pitts, W. A Logical Calculus of the Ideas Immanent in NervousActivity. Bulletin of Mathematical Biophysics: 5, pp. 115-137, 1943.• Rosenblatt, F. The Perceptron: a probabilistic model for information storage &organization in the brain. Psychological Review: 65, pp. 386-408, 1958.• Minsky, M. & Papert, S. Perceptrons. MIT Pre ss, Cambridge, 1969.• Widrow, B. & Hoff, M. E. Adaptive switching circuits. IRE WESCON ConventionRecord: 4, pp. 96-104, 1960.• Rumelhart D. et al. Learning Representations by Back-Propagating Errors. Nature:329, 533-536, 1986.• Riedmiller, M. & Braun, H. A direct adaptive method for faster backpropagationlearning: the RPROP algorithm. Proceedings of the ICNN, San Francisco, pp.123-134, 1993.• Sejnowski, T. J. & Rosenberg C. R. NETtalk: A parallel network that learns toread aloud. Technical Report JHU/EECS-86/1, EE and CS Dept., John HopkinsUniversity, 1986.11• Pomerleau, D. A. ALVINN: an autonomous land vehicle in a neural network.Technical Report CMU-CS-89-107, CS Dept., Carnegie Mellon University,
View Full Document