Machine Learning 10 601 Fall 2013 Artificial Neural Networks And Deep Learning Eric Xing Lecture 7 September 25 2013 Reading Chap 5 CB Eric Xing CMU 2013 1 Recall Logistic Regression sigmoid classifier MaxEnt classifier The prediction rule p y 1 xn M 1 exp i xi 0 i 1 1 1 e T x In this case learning p y x amounts to learning 1 Algorithm gradient ascent What is the limitation Eric Xing CMU 2006 2012 2 Learning highly non linear functions f X Y f might be non linear function X vector of continuous and or discrete vars Y vector of continuous and or discrete vars The XOR gate Speech recognition Eric Xing CMU 2013 3 Our brain is very good at this Eric Xing CMU 2013 4 How a neuron works Dendrites Synapses Synapses weights Axon Activation function 1 if X Y 1 if X M X xi wi i 1 Nodes An mathematical expression p y 1 x 1 1 exp wi xi 0 i 1 M Inputs x1 1 1 e w1 w x T Linear Combiner Hard Limiter Output Y w2 x2 w0 Threshold Eric Xing CMU 2013 5 Perceptron and Neural Nets From biological neuron to artificial neuron perceptron Inputs x1 w1 Linear Combiner Hard Limiter Output Y w2 x2 Threshold From biological neuron network to artificial neuron networks Axon Soma Dendrites Synapse Axon Soma Dendrites Output Signals Synapse Input Signals Synapse Middle Layer Input Layer Eric Xing CMU 2013 Output Layer 6 Jargon Pseudo Correspondence Independent variable input variable Dependent variable output variable Coefficients weights Estimates targets Logistic Regression Model the sigmoid unit Inputs Age 34 Output 5 0 6 Gender Stage 1 4 4 S Probability of beingAlive 8 Independent variables Coefficients Dependent variable x1 x2 x3 a b c p Prediction Eric Xing CMU 2013 7 A perceptron learning algorithm Recall the nice property of sigmoid function Consider regression problem f X Y for scalar Y We used to maximize the conditional data likelihood Here Eric Xing CMU 2013 8 xd input td target output Gradient Descent od observed unit output wi weight i Eric Xing CMU 2013 9 xd input td target output The perceptron learning rules od observed unit output wi weight i Incremental mode Do until converge For each training example d in D 1 compute gradient Ed w 2 where Batch mode Do until converge 1 compute gradient ED w 2 Eric Xing CMU 2013 10 MLE vs MAP Maximum conditional likelihood estimate Maximum a posteriori estimate Eric Xing CMU 2013 11 What decision surface does a perceptron define x y Z color 0 0 1 0 1 1 1 0 1 1 1 0 NAND w1 x1 y 0 5 w2 x2 f x1 w1 x2 w2 y f 0w1 0w2 1 f 0w1 1w2 1 f 1w1 0w2 1 f 1w1 1w2 0 some possible values for w1 and w2 f a w1 0 20 0 20 0 25 0 40 Eric Xing CMU 2013 1 for a 0 for a w2 0 35 0 40 0 30 0 20 12 What decision surface does a perceptron define x y Z color 0 0 0 0 1 1 1 0 1 1 1 0 NAND w1 x1 y 0 5 w2 x2 f x1 w1 x2 w2 y f 0w1 0w2 0 f 0w1 1w2 1 f 1w1 0w2 1 f 1w1 1w2 0 f a w1 1 for a 0 for a w2 some possible values for w1 and w2 Eric Xing CMU 2013 13 What decision surface does a perceptron define x y Z color 0 0 0 0 1 1 1 0 1 1 1 0 NAND 0 5 for all units w5 w1 w w2 3 w6 f a w4 a possible set of values for w1 1 for a 0 for a w2 w3 w4 w5 w6 0 6 0 6 0 7 0 8 1 1 Eric Xing CMU 2013 14 Non Linear Separation Meningitis No cough Headache 01 Flu Cough Headache 11 No treatment Treatment 00 No disease No cough No headache 10 Pneumonia Cough No headache 011 010 111 110 101 000 Eric Xing CMU 2013 100 15 Neural Network Model Inputs Age 6 34 2 4 S 5 1 Gender 2 2 3 S 7 Stage 4 Independent variables Output 8 S 2 Weights Hidden Layer Weights 0 6 Probability of beingAlive Dependent variable Prediction Eric Xing CMU 2013 16 Combined logistic models Inputs Age Output 6 34 5 1 Gender S 2 8 7 Stage Probability of beingAlive 4 Independent variables Weights Hidden Layer 0 6 Weights Dependent variable Prediction Eric Xing CMU 2013 17 Inputs Age Output 34 5 2 Gender 2 S 3 Probability of beingAlive 8 Stage 4 Independent variables 2 Weights Hidden Layer 0 6 Weights Dependent variable Prediction Eric Xing CMU 2013 18 Inputs Age Output 6 34 5 2 1 Gender 1 S 3 7 Stage 4 Independent variables Probability of beingAlive 8 2 Weights Hidden Layer 0 6 Weights Dependent variable Prediction Eric Xing CMU 2013 19 Not really no target for hidden units Age 6 34 2 4 S 5 1 Gender 2 2 3 S 7 Stage 4 Independent variables 8 S 2 Weights Hidden Layer Weights 0 6 Probability of beingAlive Dependent variable Prediction Eric Xing CMU 2013 20 Recall perceptrons Input units Cough Headache rule change weights to decrease the error weights No disease Pneumonia Flu Meningitis what we got what we wanted error Output units Eric Xing CMU 2013 21 Hidden Units and Backpropagation Input units what we got what we wanted error rule Hidden units rule Output units Eric Xing CMU 2013 22 xd input td target output Backpropagation Algorithm od observed unit output wi weight i Initialize all weights to small random numbers Until convergence Do 1 Input the training example to the network and compute the network outputs 1 For each output unit k 2 For each hidden unit h 3 Undate each network weight wi j where Eric Xing CMU 2013 23 More on Backpropatation It is doing gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local not necessarily global error minimum In practice often works well can run multiple times Often include weight momentum a Minimizes error over training examples Will it generalize well to subsequent testing examples Training can take thousands of iterations very slow Using network after training is very fast Eric Xing CMU 2013 24 Learning Hidden Layer Representation A network A target function Can this be learned Eric Xing CMU 2013 25 Learning Hidden Layer Representation A network Learned hidden layer representation Eric Xing CMU 2013 26 Training Eric Xing CMU 2013 27 Training Eric Xing CMU 2013 28 The Driver Network Eric Xing CMU 2013 29 Artificial neural networks what you should know Highly expressive non linear functions Highly parallel network of logistic function units Minimizing sum of squared training errors Minimizing sum of sq errors plus weight squared regularization Gives MLE estimates of network weights if we assume zero mean Gaussian noise on output values MAP estimates assuming weight priors are zero …
View Full Document
Unlocking...