UCI ICS 273A - Neural Networks - D2556426

Home> Schools> University of California, Irvine> (ICS) > ICS 273A> Neural Networks

DOC PREVIEW

UCI ICS 273A - Neural Networks

School name University of California, Irvine

Course Ics 273a- Machine Learning

Pages 14

This preview shows page 1-2-3-4-5 out of 14 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

ICS 273A UC Irvine Instructor: Max Welling Neural NetworksNeurons • Neurons communicate by receiving signals on their dendrites. Adding these signals and firing off a new signal along the axon if the total input exceeds a threshold. • The axon connects to new dendrites through synapses which can learn how much signal is transmitted. • McCulloch and Pitt (ʼ43) built a first abstract model of a neuron. ()iiiygWxb=+∑input output weights activation function bias 1 bNeurons • We have about neurons, each one connected to other neurons on average. • Each neuron needs at least seconds to transmit the signal. • So we have many, slow neurons. Yet we recognize our grandmother in sec. • Computers have much faster switching times: sec. • Conclusion: brains compute in parallel ! • In fact, neurons are unreliable/noisy as well. But since things are encoded redundantly by many of them, their population can do computation reliably and fast. 1110410310−110−1010−Classification & Regression • Neural nets are a parameterized function Y=f(X;W) from inputs (X) to outputs (Y). • If Y is continuous: regression, if Y is discrete: classification. • We adapt the weights so as to minimize the error between the data and the model predictions. • This is just a perceptron with a quadratic cost function. 211 1()out inddNin ij jn ini jerror y W x b== ==−−∑∑ ∑()ifxjxijWOptimization • We use stochastic gradient descent: pick a single data-item, compute the contribution of that data-point to the overall gradient and update the weights. ! Repeat :1) Pick random data - item (xn,yn)2) Define :"in= (yin# Wikxkn# bi)k$3) Update : Wij% Wij+&"inxjnbi% bi+&"inStochastic Gradient Descent stochastic updates full updates (averaged over all data-items) • Stochastic gradient descent does not converge to the minimum, but “dances” around it. • To get to the minimum, one needs to decrease the step-size as one get closer to the minimum. • Alternatively, one can obtain a few samples and average predictions over them (similar to bagging).Multi-Layer Nets • Single layers can only do linear things. If we want to learn non-linear decision surfaces, or non-linear regression curves, we need more than one layer. • In fact, NN with 1 hidden layer can approximate any boolean and cont. functions 32 3ˆ()i ij j ijygWhb=+∑y h2 h1 x 2212()i ij j ijhgWhb=+∑111()i ij j ijhgWxb=+∑W3,b3 W2,b2 W1,b1Back-propagation error=yinin!loghin3+ (1 "yin)log(1 "hin3);d errorndhin3=yinhin3"1 "yin1 "hin3;d errordWjk2=d errornhin3dhin3dWjk2in!=d errorndhin3hin3(1 "hin3)d Wis3hsn2+bi3s!#$%%&'((dWjk2in!=d errorndhin3hin3(1 "hin3)Wij3dhjn2dWjk2in!=d errorndhin3hin3(1 "hin3)Wij3hjn2(1 "hjn2)d Wjs2hsn1+bj2s!#$%%&'((dWjk2in!=d errorndhin3hin3(1 "hin3)Wij3hjn2(1 "hjn2)hkn1in!=d errorndhin3hin3(1 "hin3)Wij3hjn2(1 "hjn2)!in!(Wkl1xln+bk1)l!• How do we learn the weights of a multi-layer network? Answer: Stochastic gradient descent. But now the gradients are harder! i W3,b3 x h1 h2 y W2,b2 W1,b1Back Propagation !in3=hin3(1 !hin3)d errorindhin3i W3,b3 x h1 h2 y W2,b2 W1,b1 i W3,b3 x h1 h2 y W2,b2 W1,b1 hin3=!(Wij3hjn2j!+bi3)hin2=!(Wij2hjn1j!+bi2)hin1=!(Wij1xjn+bi1j!)Upward pass downward pass 22 2 33(1 )jn jn jn ij inupstream ihh Wδδ=−∑11 1 22(1 )jnkn kn kn jkupstream jhh Wδδ=−∑Back Propagation !in3=hin3(1 !hin3)d errorindhin3i W3,b3 x h1 h2 y W2,b2 W1,b1 22 2 33(1 )jn jn jn ij inupstream ihh Wδδ=−∑11 1 22(1 )jnkn kn kn jkupstream jhh Wδδ=−∑d errordWjk2=d errorndhin3hin3(1 !hin3)Wij3hjn2(1 !hjn2)hkn1in"=!jn2hkn12221222jnjk jk knj j jnWW hbbηδηδ←−←−ALVINN This hidden unit detects a mildly left sloping road and advices to steer left. How would another hidden unit look like? Learning to drive a carWeight Decay • NN can also overfit (of course). • We can try to avoid this by initializing all weights/biases terms to very small random values and grow them during learning. • One can now check performance on a validation set and stop early. • Or one can change the update rule to discourage large weights: 2221 2222 2jnjk jk kn jkjjjnjWW hWbb bηδ ληδ λ←− −←− −• Now we need to set using X-validation. • This is called “weight-decay” in NN jargon. λMomentum • In the beginning of learning it is likely that the weights are changed in a consistent manner. • Like a ball rolling down a hill, we should gain speed if we make consistent changes. Itʼs like an adaptive stepsize. • This idea is easily implemented by changing the gradient as follows: 221222 2() ()()jnjk kn jkjk jk jkW new h W oldW W W newηδ γΔ=+Δ←−Δ(and similar to biases)Conclusion • NN are a flexible way to model input/output functions • They are robust against noisy data • Hard to interpret the results (unlike DTs) • Learning is fast on large datasets when using stochastic gradient descent plus momentum. • Local minima in optimization is a problem • Overfitting can be avoided using weight decay or early stopping • There are also NN which feed information back (recurrent NN) • Many more interesting NNs: Boltzman machines, self-organizing

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5 out of 14 pages.

UCI ICS 273A - Neural Networks

Sign up for free to view:

Please select your school