DOC PREVIEW
Berkeley COMPSCI 182 - Lecture Notes

This preview shows page 1-2-3-19-20-39-40-41 out of 41 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Connectionist Models: BackpropHebb’s rule is not sufficientHebb’s rule is insufficientModels of LearningAbbstract NeuronBoolean XORSupervised Learning - BackpropBackpropTasksSigmoid Squashing FunctionThe Sigmoid FunctionSlide 22Slide 23Gradient DescentGradient Descent on an errorLearning Rule – Gradient Descent on an Root Mean Square (RMS)Slide 29Slide 30Backpropagation AlgorithmBackprop DetailsThe output layerSlide 35The hidden layerLet’s just do an exampleAn informal account of BackPropSlide 40Slide 41Momentum termConvergencePattern Separation and NN architectureOverfitting and generalizationStopping criteriaOverfitting in ANNsSummaryALVINN drives 70mph on highwaysUse MLP Neural Networks when …Applications of FFNNSlide 63Extensions of Backprop NetsElman Nets & Jordan NetsRecurrent BackpropConnectionist Models: BackpropJerome FeldmanCS182/CogSci110/Ling109Spring 2008Hebb’s rule is not sufficientWhat happens if the neural circuit fires perfectly, but the result is very bad for the animal, like eating something sickening? A pure invocation of Hebb’s rule would strengthen all participating connections, which can’t be good. On the other hand, it isn’t right to weaken all the active connections involved; much of the activity was just recognizing the situation – we would like to change only those connections that led to the wrong decision. No one knows how to specify a learning rule that will change exactly the offending connections when an error occurs. Computer systems, and presumably nature as well, rely upon statistical learning rules that tend to make the right changes over time. More in later lectures.Hebb’s rule is insufficientshould you “punish” all the connections?tastebud tastes rotten eats food gets sickdrinks waterModels of LearningHebbian – coincidenceSupervised – correction (backprop)Recruitment – one-trialReinforcement Learning- delayed rewardUnsupervised – similarityAbbstract Neuronw2wnw1w0i0=1o u t p u t yi2ini1. . .i n p u t iniiiiwnet0y1 if net > 00 otherwise{Threshold Activation FunctionBoolean XORinput x1input x2output 0 0 00 1 11 0 11 1 0h2x2ox1h111.5AND110.5OR110.5XOR1Supervised Learning - BackpropHow do we train the weights of the networkBasic ConceptsUse a continuous, differentiable activation function (Sigmoid)Use the idea of gradient descent on the error surfaceExtend to multiple layersBackpropTo learn on data which is not linearly separable:Build multiple layer networks (hidden layer)Use a sigmoid squashing function instead of a step function.TasksUnconstrained pattern classificationCredit assessmentDigit Classification Speech RecognitionFunction approximationLearning controlStock predictionSigmoid Squashing Functionw2wnw1w0y0=1o u t p u ty2yny1. . .i n p u tniiiywnet0netey11The Sigmoid Functionx=nety=aThe Sigmoid Functionx=netiy=aOutput=0Output=1The Sigmoid Functionx=nety=aOutput=0Output=1Sensitivity to inputGradient DescentGradient Descent on an errorLearning Rule – Gradient Descent on an Root Mean Square (RMS)Learn wi’s that minimize squared error21[ ] ( )2k kk OE w t oe= -�rO = output layerGradient DescentGradient:nwEwEwEwE ,...,,][10iiwEwTraining rule:][wEw21[ ] ( )2k kk OE w t oe= -�rGradient Descenti2i1global mimimum: this is your goalit should be 4-D (3 weights) but you get the ideaBackpropagation AlgorithmGeneralization to multiple layers and multiple output unitsBackprop DetailsHere we go…k j iwjkwijE = Error = ½ ∑i (ti – yi)2yiti: targetijijijWEWWijijWEW jiiiijiiiiijyxfytWxxyyEWE)('The derivative of the sigmoid is just  iiyy 1   jiiiiijyyyytW  1ijijyW   iiiiiyyyt  1The output layerlearning rateNice Property of Sigmoidsk j iwjkwijE = Error = ½ ∑i (ti – yi)2yiti: targetThe hidden layerjkjkWEWjkjjjjjkWxxyyEWEiijiiiijiiiijWxfytyxxyyEyE)(')(kjiijiiijkyxfWxfytWE)(')(')(  kjjiijiiiijkyyyWyyytW 11)(jkjkyW  jjiijiiiijyyWyyyt 11)( jjiiijjyyW 1Let’s just do an exampleE = Error = ½ ∑i (ti – yi)2x0fi1w01y0i2b=1w02w0b E = ½ (t0 – y0)2i1i2y00 0 00 1 11 0 11 1 10.80.60.5000.62240.51/(1+e^-0.5) E = ½ (0 – 0.6224)2 = 0.1937ijijyW   iiiiiyyyt  101 i00   000001 yyyt    6224.016224.06224.0001463.001463.00101 yW0202 yW00bbyW02 i0 blearning ratesuppose  = 0.50731.01463.05.00bW0.4268An informal account of BackPropFor each pattern in the training set: Compute the error at the output nodesCompute w for each wt in 2nd layerCompute delta (generalized error expression) for hidden unitsCompute w for each wt in 1st layerAfter amassing w for all weights and, change each wt a little bit, as determined by the learning ratejpipijowBackpropagation AlgorithmInitialize all weights to small random numbersFor each training example doFor each hidden unit h:For each output unit k:For each output unit k:For each hidden unit h:Update each network weight wij:ijjijxwiihihxwy )(khkhkxwy )()()1(kkkkkytyy kkhkhhhwyy)1(withijijijwww Backpropagation Algorithm“activations”“errors”Momentum termThe speed of learning is governed by the learning rate.If the rate is low, convergence is slowIf the rate is too high, error oscillates without reaching minimum.Momentum tends to smooth small weight error fluctuations. n)(n)y()1n(ijwn)(ijwji10 the momentum accelerates the descent in steady downhill directions.the momentum has a stabilizing effect in directions that oscillate in


View Full Document

Berkeley COMPSCI 182 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?