Toronto CSC 2515 - Backpropagation lecture Notes

Unformatted text preview:

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 4: BackpropagationWhy we need backpropagationLearning by perturbing weightsThe idea behind backpropagationA difference in notationNon-linear neurons with smooth derivativesSketch of the backpropagation algorithm on a single training caseThe derivativesSome Success StoriesOverview of the applications in this lectureAn example of relational informationAnother way to express the same informationA relational learning taskThe structure of the neural netHow to show the weights of hidden unitsThe features it learned for person 1What the network learnsAnother way to see that it worksWhy this is interestingA basic problem in speech recognitionThe standard “trigram” methodWhy the trigram model is sillyBengio’s neural net for predicting the next word2-D display of some of the 100-D feature vectors learned by another language modelApplying backpropagation to shape recognitionThe invariance problemLe NetThe replicated feature approachThe architecture of LeNet5Backpropagation with weight constraintsCombining the outputs of replicated featuresSlide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41The 82 errors made by LeNet5Slide 43Recurrent networksAn advantage of modeling sequential dataThe equivalence between layered, feedforward nets and recurrent netsBackpropagation through timeTeaching signals for recurrent networksA good problem for a recurrent networkThe algorithm for binary additionA recurrent net for binary additionThe connectivity of the networkSlide 53Preventing overfitting by early stoppingWhy early stopping worksFull Bayesian LearningHow to deal with the fact that the space of all possible parameters vectors is hugeOne method for sampling weight vectorsAn amazing factCSC2515 Fall 2007 Introduction to Machine LearningLecture 4: BackpropagationAll lecture slides will be available as .ppt, .ps, & .htm atwww.cs.toronto.edu/~hintonMany of the figures are provided by Chris Bishop from his textbook: ”Pattern Recognition and Machine Learning”Why we need backpropagation•Networks without hidden units are very limited in the input-output mappings they can model.–More layers of linear units do not help. Its still linear.–Fixed output non-linearities are not enough•We need multiple layers of adaptive non-linear hidden units. This gives us a universal approximator. But how can we train such nets?–We need an efficient way of adapting all the weights, not just the last layer. This is hard. Learning the weights going into hidden units is equivalent to learning features. –Nobody is telling us directly what hidden units should do.Learning by perturbing weights•Randomly perturb one weight and see if it improves performance. If so, save the change.–Very inefficient. We need to do multiple forward passes on a representative set of training data just to change one weight.–Towards the end of learning, large weight perturbations will nearly always make things worse.•We could randomly perturb all the weights in parallel and correlate the performance gain with the weight changes. –Not any better because we need lots of trials to “see” the effect of changing one weight through the noise created by all the others.Learning the hidden to output weights is easy. Learning the input to hidden weights is hard.hidden unitsoutput unitsinput unitsThe idea behind backpropagation•We don’t know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity.– Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities.–Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined.–We can compute error derivatives for all the hidden units efficiently. –Once we have the error derivatives for the hidden activities, its easy to get the error derivatives for the weights going into a hidden unit.A difference in notation•For networks with multiple hidden layers Bishop uses an explicit extra index to denote the layer.•The lecture notes use a simpler notation in which the typical index is used to denote the layer implicitly.y is used for the output of a unit in any layerx is the summed input to a unit in any layerThe index indicates which layer a unit is in. ijjyxyNon-linear neurons with smooth derivatives•For backpropagation, we need neurons that have well-behaved derivatives.–Typically they use the logistic function–The output is a smooth function of the inputs and the weights.)1(11jjjjijijiijjjjijiijjyydxdywyxywxxeywybx0.5001jxjySketch of the backpropagation algorithmon a single training case•First convert the discrepancy between each output and its target value into an error derivative.•Then compute error derivatives in each hidden layer from error derivatives in the layer above.•Then use error derivatives w.r.t. activities to get error derivatives w.r.t. the weights.ijjjjjjjyEyEdyyEdyE221)(The derivativesjjijjjijijijijjijjjjjjjjxEwxEdydxyExEyxEwxwEyEyyyEdxdyxE)1(jiijjyxySome Success Stories•Back-propagation has been used for a large number of practical applications.–Recognizing hand-written characters–Predicting the future price of stocks–Detecting credit card fraud–Recognize speech (wreck a nice beach)–Predicting the next word in a sentence from the previous words•This is essential for good speech recognition.Overview of the applications in this lecture•Modeling relational data –This toy application shows that the hidden units can learn to represent sensible features that are not at all obvious.–It also bridges the gap between relational graphs and feature vectors.•Learning to predict the next word in a sentence–The toy model above can be turned into a useful model for predicting words to help a speech recognizer.•Reading documents–An impressive application that is used to read checks.An example of relational information Christopher = Penelope Andrew = ChristineMargaret = Arthur Victoria = James Jennifer = Charles Colin Charlotte Roberto = Maria Pierro = Francesca Gina = Emilio Lucia = Marco Angela = Tomaso Alfonso SophiaAnother way to express


View Full Document

Toronto CSC 2515 - Backpropagation lecture Notes

Download Backpropagation lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Backpropagation lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Backpropagation lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?