Machine Learning 10 701 Tom M Mitchell Machine Learning Department Carnegie Mellon University March 24 2011 Today Reading Non linear regression Artificial neural networks Backpropagation Cognitive modeling Deep belief networks Mitchell Chapter 4 Bishop Chapter 5 Artificial Neural Networks to learn f X Y f might be non linear function X vector of continuous and or discrete vars Y vector of continuous and or discrete vars Represent f by network of logistic units Each unit is a logistic function MLE train weights of all units to minimize sum of squared errors of predicted network outputs MAP train to minimize sum of squared errors plus weight magnitudes 1 ALVINN Pomerleau 1993 2 3 M C LE Training for Neural Networks Consider regression problem f X Y for scalar Y y f x assume noise N 0 iid deterministic Let s maximize the conditional data likelihood Learned neural network MAP Training for Neural Networks Consider regression problem f X Y for scalar Y y f x noise N 0 deterministic Gaussian P W N 0 ln P W c i wi2 4 xd input td target output od observed unit output wi weight i 5 MLE xd input td target output od observed unit output wij wt from i to j 6 7 Dealing with Overfitting Our learning algorithm involves a parameter n number of gradient descent iterations How do we choose n to optimize future error note similar issue for logistic regression decision trees e g the n that minimizes error rate of neural net over future data Dealing with Overfitting Our learning algorithm involves a parameter n number of gradient descent iterations How do we choose n to optimize future error Separate available data into training and validation set Use training to perform gradient descent n number of iterations that optimizes validation set error gives unbiased estimate of optimal n but a biased estimate of true error 8 K Fold Cross Validation Idea train multiple times leaving out a disjoint subset of data each time for test Average the test set accuracies Partition data into K disjoint subsets For k 1 to K testData kth subset h classifier trained on all data except for testData accuracy k accuracy of h on testData end FinalAccuracy mean of the K recorded testset accuracies might withhold some of this to choose number of gradient decent steps Leave One Out Cross Validation This is just k fold cross validation leaving out one example each iteration Partition data into K disjoint subsets each containing one example For k 1 to K testData kth subset h classifier trained on all data except for testData accuracy k accuracy of h on testData end FinalAccuracy mean of the K recorded testset accuracies might withhold some of this to choose number of gradient decent steps 9 10 11 12 w0 left strt right up 13
View Full Document