Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Neural Networks10701/15781 RecitationFebruary 12, 2008Parts of the slides are from previous years’ 10701 recitation and lecture notes, and from Prof. Andrew Moore’s data mining tutorials.Recall Linear RegressionPrediction of continuous variablesLearn the mapping f: X YModel is linear in the parameters w (+ some noise)Assume Gaussian noiseLearn MLE w =))(or()(iiiiiixwxwxf)()(1YXXXTT Neural NetworkNeural nets are also models with w parameters in them. They are now called weights.As before, we want to compute the weights to minimize sum-of-squared residualsWhich turns out, under “Gaussian i.i.d noise” assumption to be max. likelihood.Instead of explicitly solving for max. likelihood weights, we use Gradient DescentInput x=(x1,…, xn) and target value t:or Given training data {(x(l),t(l))}, find w which minimizes Perceptrons)()(Output10niiixwwfo x)(:11)(,where)())(exp(11)(1010sigmoidenetxwwnetnetxwwonetniiiniiixLlllxotE12)()())((21otherwise0if11)()(e.g.netnetsigno xGradient descentGeneral framework for finding a minimum of a continuous (differentiable) function f(w)Start with some initial value w(1) and compute the gradient vector The next value w(2) is obtained by moving some distance from w(1) in the direction of steepest descent, i.e., along the negative of the gradient)()1(wf)()()()()1( kkkkf www Gradient Descent on a PerceptronThe sigmoid perceptron update rule lllljnjjlljlLllljjtxwxww),(where)1()(0)(1Boolean Functionse.g using step activation function with threshold 0, can we learn the functionX1 AND X2?X2 OR X2?X2 AND NOT X2?X2 XOR X2?Multilayer NetworksThe class of functions representable by perceptron is limitedThink of nonlinear functions:))(()(iijijjxwfWhxoA 1-Hidden layer NetNinput=2, Nhidden=3, Noutput=1BackpropagationHW2 – Problem 2Output in k-th output unit from input xWith bias: add a constant term for every non-input unit Learn w to minimize ))(()(iijijkjkxwfWfo xKkkkotE12))((21xBackpropagationInitialize all weights Do until convergence1. Input a training example to the network and compute the output ok2. Update each hidden-to-output weight wkj by3. Update each input-to-hidden weight wji byjynetfotywwjkkkkjkkjkjunithiddenfromoutput:,)(')(where
View Full Document