Neural'Networks'Aarti Singh Machine Learning 10-601 Nov 3, 2011 Slides Courtesy: Tom Mitchell TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAA 1Logis0c'Regression'2 Assumes&the&following&func1onal&form&for&P(Y|X):&Logistic function (or Sigmoid): Logis1c&func1on&applied&to&a&linear&func1on&of&the&data&z logit (z) Features can be discrete or continuous!Logis0c'Regression'is'a'Linear'Classifier!'3 Assumes&the&following&func1onal&form&for&P(Y|X):&&&&&Decision&boundary:&(Linear Decision Boundary) 1 1Training'Logis0c'Regression'4 How'to'learn'the'parameters'w0,'w1,'…'wd?'Training&Data&Maximum&(Condi1onal)&Likelihood&Es1mates&&&&Discrimina1ve&philosophy&–&Don’t&waste&effort&learning&P(X),&focus&on&P(Y|X)&–&that’s&all&that&maLers&for&classifica1on!&&Op0mizing'convex'func0on'5 • Max&Condi1onal&logOlikelihood&&=&Min&Nega1ve&Condi1onal&logOlikelihood • Nega1ve&Condi1onal&logOlikelihood&is&a&convex&func1on&Gradient Descent (convex)'Gradient: Learning rate, η>0 Update rule:Logis0c'func0on'as'a'Gr aph'Sigmoid Unit d d dNeural'Networks'to'learn'f:'X'à'Y'• f&can&be&a&nonKlinear'func1on&• X&(vector&of)&con1nuous&and/or&discrete&variables&• Y&(vector&of)&con1nuous&and/or&discrete&variables&• Neural&networks&O&Represent&f&by&network&of&logis1c/sigmoid&units:&Input layer, X Output layer, Y Hidden layer, H Sigmoid UnitNeural Network trained to distinguish vowel sounds using 2 formants (features) Highly non-linear decision surface Two layers of logistic units Input layer Hidden layer Output layerNeural Network trained to drive a car! Weights of each pixel for one hidden unit Weights to output units from the hidden unitForward'Propaga0on'for'predic0on'Prediction – Given neural network (hidden units and weights), use it to predict the label of a test point Forward Propagation – Start from input layer For each subsequent layer, compute output of sigmoid unit Sigmoid unit: 1-Hidden layer, 1 output NN: ohDifferentiable d d d Training'Neural'Networks'• Consider regression problem f:XàY , for scalar Y y = f(x) + ε##### assume noise N(0,σε), iid deterministic M(C)LE'Training'for'Neural'Networks'Learned neural network • Let’s maximize the conditional data likelihood Train weights of all units to minimize sum of squared errors of predicted network outputs• Consider regression problem f:XàY , for scalar Y y = f(x) + ε##### noise N(0,σε) deterministic MAP'Training'for'Neural'Networks' Gaussian P(W) = N(0,σΙ) ln P(W) ↔ c ∑i wi2 Train weights of all units to minimize sum of squared errors of predicted network outputs plus weight magnitudesd E – Mean Square Error For Neural Networks, E[w] no longer convex in wError'Gradient'for 'a'Sigm oid'Unit'lllllllllllllllly ly ly ly ly ly llllllllllllllly Sigmoid Unit d d d(MLE) l lllly kllllll l lo Using Forward propagation yk = target output (label) ok/h = unit output (obtained by forward propagation) wij = wt from i to j Note: if i is input variable, oi = xiUsing all training data D llly l lllly lObjective/Error no longer convex in weightsOur learning algorithm involves a parameter n=number of gradient descent iterations How do we choose n to optimize future error? (note: similar issue for logistic regression, decision trees, …) e.g. the n that minimizes error rate of neural net over future data Dealing'with'OverfiVng'Our learning algorithm involves a parameter n=number of gradient descent iterations How do we choose n to optimize future error? • Separate available data into training and validation set • Use training to perform gradient descent • n ß number of iterations that optimizes validation set error Dealing'with'OverfiVng'Idea: train multiple times, leaving out a disjoint subset of data each time for test. Average the test set accuracies. ________________________________________________ Partition data into K disjoint subsets For k=1 to K testData = kth subset h ß classifier trained* on all data except for testData accuracy(k) = accuracy of h on testData end FinalAccuracy = mean of the K recorded testset accuracies * might withhold some of this to choose number of gradient decent steps KKfold'CrossKvalida0on'This is just k-fold cross validation leaving out one example each iteration ________________________________________________ Partition data into K disjoint subsets, each containing one example For k=1 to K testData = kth subset h ß classifier trained* on all data except for testData accuracy(k) = accuracy of h on testData end FinalAccuracy = mean of the K recorded testset accuracies * might withhold some of this to choose number of gradient decent steps LeaveKoneKout'CrossKvalida0on'• Cross-validation • Regularization – small weights imply NN is linear (low VC dimension) • Control number of hidden units – low complexity Dealing'with'OverfiVng'Σwixi Logistic outputw0 left strt right upSemantic Memory Model Based on ANN’s [McClelland & Rogers, Nature 2003] No hierarchy given. Train with assertions, e.g., Can(Canary,Fly)Humans act as though they have a hierarchical memory organization 1. Victims of Semantic Dementia progressively lose knowledge of objects But they lose specific details first, general properties later, suggesting hierarchical memory organization Thing Living Animal Plant NonLiving Bird Fish Canary 2. Children appear to learn general categories and properties first, following the same hierarchy, top down*. * some debate remains on this.!Question: What learning mechanism could produce this emergent hierarchy?Memory deterioration follows semantic hierarchy [McClelland & Rogers, Nature 2003]• Suppose we want to predict next state of world – and it depends on history of unknown length – e.g., robot with forward-facing sensors trying to predict next sensor reading as it moves and turns Training'Networks'on'Time'Series'• Suppose we want to predict next state of world – and it depends on history of unknown length – e.g., robot with forward-facing sensors trying to predict next sensor reading as it moves and turns • Idea: use hidden layer in network to capture state history Training'Networks'on'Time'Series'How can we train recurrent net?? Training'Networks'on'Time'Series'Ar0ficial'Neural'Networks:'Summary'• Ac1vely&used&to&model&distributed&computa1on&in&brain&•
View Full Document