ICS 273A UC Irvine Instructor: Max Welling Neural NetworksNeurons • Neurons communicate by receiving signals on their dendrites. Adding these signals and firing off a new signal along the axon if the total input exceeds a threshold. • The axon connects to new dendrites through synapses which can learn how much signal is transmitted. • McCulloch and Pitt (ʼ43) built a first abstract model of a neuron. ()iiiygWxb=+∑input output weights activation function bias 1 bNeurons • We have about neurons, each one connected to other neurons on average. • Each neuron needs at least seconds to transmit the signal. • So we have many, slow neurons. Yet we recognize our grandmother in sec. • Computers have much faster switching times: sec. • Conclusion: brains compute in parallel ! • In fact, neurons are unreliable/noisy as well. But since things are encoded redundantly by many of them, their population can do computation reliably and fast. 1110410310−110−1010−Classification & Regression • Neural nets are a parameterized function Y=f(X;W) from inputs (X) to outputs (Y). • If Y is continuous: regression, if Y is discrete: classification. • We adapt the weights so as to minimize the error between the data and the model predictions. • This is just a perceptron with a quadratic cost function. 211 1()out inddNin ij jn ini jerror y W x b== ==−−∑∑ ∑()ifxjxijWOptimization • We use stochastic gradient descent: pick a single data-item, compute the contribution of that data-point to the overall gradient and update the weights. ! Repeat :1) Pick random data - item (xn,yn)2) Define :"in= (yin# Wikxkn# bi)k$3) Update : Wij% Wij+&"inxjnbi% bi+&"inStochastic Gradient Descent stochastic updates full updates (averaged over all data-items) • Stochastic gradient descent does not converge to the minimum, but “dances” around it. • To get to the minimum, one needs to decrease the step-size as one get closer to the minimum. • Alternatively, one can obtain a few samples and average predictions over them (similar to bagging).Multi-Layer Nets • Single layers can only do linear things. If we want to learn non-linear decision surfaces, or non-linear regression curves, we need more than one layer. • In fact, NN with 1 hidden layer can approximate any boolean and cont. functions 32 3ˆ()i ij j ijygWhb=+∑y h2 h1 x 2212()i ij j ijhgWhb=+∑111()i ij j ijhgWxb=+∑W3,b3 W2,b2 W1,b1Back-propagation error=yinin!loghin3+ (1 "yin)log(1 "hin3);d errorndhin3=yinhin3"1 "yin1 "hin3;d errordWjk2=d errornhin3dhin3dWjk2in!=d errorndhin3hin3(1 "hin3)d Wis3hsn2+bi3s!#$%%&'((dWjk2in!=d errorndhin3hin3(1 "hin3)Wij3dhjn2dWjk2in!=d errorndhin3hin3(1 "hin3)Wij3hjn2(1 "hjn2)d Wjs2hsn1+bj2s!#$%%&'((dWjk2in!=d errorndhin3hin3(1 "hin3)Wij3hjn2(1 "hjn2)hkn1in!=d errorndhin3hin3(1 "hin3)Wij3hjn2(1 "hjn2)!in!(Wkl1xln+bk1)l!• How do we learn the weights of a multi-layer network? Answer: Stochastic gradient descent. But now the gradients are harder! i W3,b3 x h1 h2 y W2,b2 W1,b1Back Propagation !in3=hin3(1 !hin3)d errorindhin3i W3,b3 x h1 h2 y W2,b2 W1,b1 i W3,b3 x h1 h2 y W2,b2 W1,b1 hin3=!(Wij3hjn2j!+bi3)hin2=!(Wij2hjn1j!+bi2)hin1=!(Wij1xjn+bi1j!)Upward pass downward pass 22 2 33(1 )jn jn jn ij inupstream ihh Wδδ=−∑11 1 22(1 )jnkn kn kn jkupstream jhh Wδδ=−∑Back Propagation !in3=hin3(1 !hin3)d errorindhin3i W3,b3 x h1 h2 y W2,b2 W1,b1 22 2 33(1 )jn jn jn ij inupstream ihh Wδδ=−∑11 1 22(1 )jnkn kn kn jkupstream jhh Wδδ=−∑d errordWjk2=d errorndhin3hin3(1 !hin3)Wij3hjn2(1 !hjn2)hkn1in"=!jn2hkn12221222jnjk jk knj j jnWW hbbηδηδ←−←−ALVINN This hidden unit detects a mildly left sloping road and advices to steer left. How would another hidden unit look like? Learning to drive a carWeight Decay • NN can also overfit (of course). • We can try to avoid this by initializing all weights/biases terms to very small random values and grow them during learning. • One can now check performance on a validation set and stop early. • Or one can change the update rule to discourage large weights: 2221 2222 2jnjk jk kn jkjjjnjWW hWbb bηδ ληδ λ←− −←− −• Now we need to set using X-validation. • This is called “weight-decay” in NN jargon. λMomentum • In the beginning of learning it is likely that the weights are changed in a consistent manner. • Like a ball rolling down a hill, we should gain speed if we make consistent changes. Itʼs like an adaptive stepsize. • This idea is easily implemented by changing the gradient as follows: 221222 2() ()()jnjk kn jkjk jk jkW new h W oldW W W newηδ γΔ=+Δ←−Δ(and similar to biases)Conclusion • NN are a flexible way to model input/output functions • They are robust against noisy data • Hard to interpret the results (unlike DTs) • Learning is fast on large datasets when using stochastic gradient descent plus momentum. • Local minima in optimization is a problem • Overfitting can be avoided using weight decay or early stopping • There are also NN which feed information back (recurrent NN) • Many more interesting NNs: Boltzman machines, self-organizing
View Full Document