An informal account of BackPropBackprop DetailsThe output layerThe hidden layerMomentum termConvergenceLocal MinimumOverfitting and generalizationOverfitting in ANNsEarly Stopping (Important!!!)Stopping criteriaArchitectural ConsiderationsSlide 17Problems and NetworksSummaryALVINN drives 70mph on highwaysUse MLP Neural Networks when …Applications of FFNNSlide 25Extensions of Backprop NetsElman Nets & Jordan NetsRecurrent BackpropConnectionist Models in Cognitive Science5 levels of Neural Theory of LanguageThe Color Story: A Bridge between Levels of NTLA Tour of the Visual SystemSlide 33Slide 34Slide 35Slide 36Slide 40The Microscopic ViewRods and Cones in the RetinaWhat Rods and Cones DetectSlide 44Slide 45Slide 46Slide 47Slide 48Slide 49Slide 50The WCS Color ChipsResults of Kay’s Color StudySlide 53Slide 54Slide 55Slide 56Slide 57Slide 58Slide 59Slide 60Slide 61Slide 62An informal account of BackPropFor each pattern in the training set: Compute the error at the output nodesCompute w for each wt in 2nd layerCompute delta (generalized error expression) for hidden unitsCompute w for each wt in 1st layerAfter amassing w for all weights and, change each wt a little bit, as determined by the learning ratejpipijowBackprop DetailsHere we go…Also refer to web notes for derivationk j iwjkwijE = Error = ½ ∑i (ti – yi)2yiti: targetijijijWEWWijijWEW jiiiijiiiiijyxfytWxxyyEWE)('The derivative of the sigmoid is just iiyy 1 jiiiiijyyyytW 1ijijyW iiiiiyyyt 1The output layerlearning ratek j iwjkwijE = Error = ½ ∑i (ti – yi)2yiti: targetThe hidden layerjkjkWEWjkjjjjjkWxxyyEWEiijiiiijiiiijWxfytyxxyyEyE)(')(kjiijiiijkyxfWxfytWE)(')(')( kjjiijiiiijkyyyWyyytW 11)(jkjkyW jjiijiiiijyyWyyyt 11)( jjiiijjyyW 1Momentum termThe speed of learning is governed by the learning rate.If the rate is low, convergence is slowIf the rate is too high, error oscillates without reaching minimum.Momentum tends to smooth small weight error fluctuations. n)(n)y()1n(ijwn)(ijwji10 the momentum accelerates the descent in steady downhill directions.the momentum has a stabilizing effect in directions that oscillate in time.ConvergenceMay get stuck in local minimaWeights may diverge…but works well in practiceRepresentation power:2 layer networks : any continuous function3 layer networks : any functionLocal MinimumUSE A RANDOM COMPONENT SIMULATED ANNEALINGOverfitting and generalizationTOO MANY HIDDEN NODES TENDS TO OVERFITOverfitting in ANNsEarly Stopping (Important!!!)Stop training when error goes up on validation setStopping criteriaSensible stopping criteria:total mean squared error change: Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]).generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights.Architectural ConsiderationsWhat is the right size network for a given job?How many hidden units?Too many: no generalizationToo few: no solutionPossible answer: Constructive algorithm, e.g.Cascade Correlation (Fahlman, & Lebiere 1990)etcThe number of layers and of neurons depend on the specific task. In practice this issue is solved by trial and error.Two types of adaptive algorithms can be used:start from a large network and successively remove some nodes and links until network performance degrades.begin with a small network and introduce new neurons until performance is satisfactory.Network TopologyProblems and Networks•Some problems have natural "good" solutions•Solving a problem may be possible by providing the right armory of general-purpose tools, and recruiting them as needed•Networks are general purpose tools.•Choice of network type, training, architecture, etc greatly influences the chances of successfully solving a problem•Tension: Tailoring tools for a specific job Vs Exploiting general purpose learning mechanismSummaryMultiple layer feed-forward networksReplace Step with Sigmoid (differentiable) function Learn weights by gradient descent on error functionBackpropagation algorithm for learningAvoid overfitting by early stoppingALVINN drives 70mph on highwaysUse MLP Neural Networks when …(vectored) Real inputs, (vectored) real outputsYou’re not interested in understanding how it worksLong training times acceptableShort execution (prediction) times requiredRobust to noise in the datasetApplications of FFNNClassification, pattern recognition:FFNN can be applied to tackle non-linearly separable learning problems.Recognizing printed or handwritten characters,Face recognitionClassification of loan applications into credit-worthy and non-credit-worthy groupsAnalysis of sonar radar to determine the nature of the source of a signalRegression and forecasting:FFNN can be applied to learn non-linear functions (regression) and in particular functions whose inputs is a sequence of measurements over time (time series).Extensions of Backprop NetsRecurrent ArchitecturesBackprop through timeElman Nets & Jordan NetsUpdating the context as we receive input•In Jordan nets we model “forgetting” as well•The recurrent connections have fixed weights•You can train these networks using good ol’ backpropOutputHiddenContext Input1αOutputHiddenContext Input1Recurrent Backprop•we’ll pretend to step through the network one iteration at a time•backprop as usual, but average equivalent weights (e.g. all 3 highlighted edges on the right are equivalent)a b cunrolling3 iterationsa b ca b ca b cw2w1 w3w4w1 w2 w3 w4a b cConnectionist Models in Cognitive
View Full Document