Connectionist Models: BackpropHebb’s rule is not sufficientHebb’s rule is insufficientModels of LearningAbbstract NeuronBoolean XORSupervised Learning - BackpropBackpropTasksSigmoid Squashing FunctionThe Sigmoid FunctionSlide 22Slide 23Gradient DescentGradient Descent on an errorLearning Rule – Gradient Descent on an Root Mean Square (RMS)Slide 29Slide 30Backpropagation AlgorithmBackprop DetailsThe output layerSlide 35The hidden layerLet’s just do an exampleAn informal account of BackPropSlide 40Slide 41Momentum termConvergencePattern Separation and NN architectureOverfitting and generalizationStopping criteriaOverfitting in ANNsSummaryALVINN drives 70mph on highwaysUse MLP Neural Networks when …Applications of FFNNSlide 63Extensions of Backprop NetsElman Nets & Jordan NetsRecurrent BackpropConnectionist Models: BackpropJerome FeldmanCS182/CogSci110/Ling109Spring 2008Hebb’s rule is not sufficientWhat happens if the neural circuit fires perfectly, but the result is very bad for the animal, like eating something sickening? A pure invocation of Hebb’s rule would strengthen all participating connections, which can’t be good. On the other hand, it isn’t right to weaken all the active connections involved; much of the activity was just recognizing the situation – we would like to change only those connections that led to the wrong decision. No one knows how to specify a learning rule that will change exactly the offending connections when an error occurs. Computer systems, and presumably nature as well, rely upon statistical learning rules that tend to make the right changes over time. More in later lectures.Hebb’s rule is insufficientshould you “punish” all the connections?tastebud tastes rotten eats food gets sickdrinks waterModels of LearningHebbian – coincidenceSupervised – correction (backprop)Recruitment – one-trialReinforcement Learning- delayed rewardUnsupervised – similarityAbbstract Neuronw2wnw1w0i0=1o u t p u t yi2ini1. . .i n p u t iniiiiwnet0y1 if net > 00 otherwise{Threshold Activation FunctionBoolean XORinput x1input x2output 0 0 00 1 11 0 11 1 0h2x2ox1h111.5AND110.5OR110.5XOR1Supervised Learning - BackpropHow do we train the weights of the networkBasic ConceptsUse a continuous, differentiable activation function (Sigmoid)Use the idea of gradient descent on the error surfaceExtend to multiple layersBackpropTo learn on data which is not linearly separable:Build multiple layer networks (hidden layer)Use a sigmoid squashing function instead of a step function.TasksUnconstrained pattern classificationCredit assessmentDigit Classification Speech RecognitionFunction approximationLearning controlStock predictionSigmoid Squashing Functionw2wnw1w0y0=1o u t p u ty2yny1. . .i n p u tniiiywnet0netey11The Sigmoid Functionx=nety=aThe Sigmoid Functionx=netiy=aOutput=0Output=1The Sigmoid Functionx=nety=aOutput=0Output=1Sensitivity to inputGradient DescentGradient Descent on an errorLearning Rule – Gradient Descent on an Root Mean Square (RMS)Learn wi’s that minimize squared error21[ ] ( )2k kk OE w t oe= -�rO = output layerGradient DescentGradient:nwEwEwEwE ,...,,][10iiwEwTraining rule:][wEw21[ ] ( )2k kk OE w t oe= -�rGradient Descenti2i1global mimimum: this is your goalit should be 4-D (3 weights) but you get the ideaBackpropagation AlgorithmGeneralization to multiple layers and multiple output unitsBackprop DetailsHere we go…k j iwjkwijE = Error = ½ ∑i (ti – yi)2yiti: targetijijijWEWWijijWEW jiiiijiiiiijyxfytWxxyyEWE)('The derivative of the sigmoid is just iiyy 1 jiiiiijyyyytW 1ijijyW iiiiiyyyt 1The output layerlearning rateNice Property of Sigmoidsk j iwjkwijE = Error = ½ ∑i (ti – yi)2yiti: targetThe hidden layerjkjkWEWjkjjjjjkWxxyyEWEiijiiiijiiiijWxfytyxxyyEyE)(')(kjiijiiijkyxfWxfytWE)(')(')( kjjiijiiiijkyyyWyyytW 11)(jkjkyW jjiijiiiijyyWyyyt 11)( jjiiijjyyW 1Let’s just do an exampleE = Error = ½ ∑i (ti – yi)2x0fi1w01y0i2b=1w02w0b E = ½ (t0 – y0)2i1i2y00 0 00 1 11 0 11 1 10.80.60.5000.62240.51/(1+e^-0.5) E = ½ (0 – 0.6224)2 = 0.1937ijijyW iiiiiyyyt 101 i00 000001 yyyt 6224.016224.06224.0001463.001463.00101 yW0202 yW00bbyW02 i0 blearning ratesuppose = 0.50731.01463.05.00bW0.4268An informal account of BackPropFor each pattern in the training set: Compute the error at the output nodesCompute w for each wt in 2nd layerCompute delta (generalized error expression) for hidden unitsCompute w for each wt in 1st layerAfter amassing w for all weights and, change each wt a little bit, as determined by the learning ratejpipijowBackpropagation AlgorithmInitialize all weights to small random numbersFor each training example doFor each hidden unit h:For each output unit k:For each output unit k:For each hidden unit h:Update each network weight wij:ijjijxwiihihxwy )(khkhkxwy )()()1(kkkkkytyy kkhkhhhwyy)1(withijijijwww Backpropagation Algorithm“activations”“errors”Momentum termThe speed of learning is governed by the learning rate.If the rate is low, convergence is slowIf the rate is too high, error oscillates without reaching minimum.Momentum tends to smooth small weight error fluctuations. n)(n)y()1n(ijwn)(ijwji10 the momentum accelerates the descent in steady downhill directions.the momentum has a stabilizing effect in directions that oscillate in
View Full Document