Overfitting PAC Learning VC Dimension VC Bounds Mistake Bounds Semi Supervised Learning Yi Zhang 10 701 Machine Learning Spring 2011 March 22nd 2011 1 Outline Overfitting True training testing errors and overfitting PAC learning finite hypothesis space Consistent learner case and agnostic case PAC learning infinite hypothesis space VC dimension VC bounds structural risk minimization Mistake bounds Find S Halving algorithm weighted majority algorithm Semi supervised learning The general idea EM co training NELL 2 Training error and true error 3 Training error and true error Is true error an unbiased approximation to the No Training error is an approximation to the true error Key h is selected using training examples On h it is likely to be an underestimate 4 Overfitting 5 Testing error and true error Testing error is an unbiased approximation to the true error as the testing set are i i d samples draw from the true distribution independently of h 6 An example of overfitting What if the training set infinite 7 Outline Overfitting True training testing errors and overfitting PAC learning finite hypothesis space Consistent learner case and agnostic case PAC learning infinite hypothesis space VC dimension VC bounds structural risk minimization Mistake bounds Find S Halving algorithm weighted majority algorithm Semi supervised learning The general idea EM co training NELL 8 PAC learning finite hypothesis space Training error underestimates the true error In PAC learning we seek theory to relate The number of training samples m The gap between training and true errors Complexity of the hypothesis space H Confidence of this relation at least 9 A special case training error is 0 In PAC learning we seek theory to relate The number of training samples m The gap between training 0 and true errors Complexity of the hypothesis space H Confidence of this relation at least What is the probability that there exists consistent hypothesis with true error i e represent using other quantities 10 Derivation 11 Bounds for finite hypothesis space 12 Agnostic learning Training error is not 0 In PAC learning we seek theory to relate The number of training samples m The gap between training and true errors Complexity of the hypothesis space H Confidence of this relation at least 13 Agnostic learning In PAC learning we seek theory to relate The number of training samples m The gap between training and true errors Complexity of the hypothesis space H Confidence of this relation at least The bound on Derived from Hoeffding bounds 14 Agnostic learning The bound on Derived from Hoeffding bounds Also 15 PAC learnable 16 Outline Overfitting True training testing errors and overfitting PAC learning finite hypothesis space Consistent learner case and agnostic case PAC learning infinite hypothesis space VC dimension VC bounds structural risk minimization Mistake bounds Find S Halving algorithm weighted majority algorithm Semi supervised learning The general idea EM co training NELL 17 PAC learning infinite hypothesis space Bounds for finite hypothesis space 18 VC dimension VC H size of the largest sample set that can be shattered by H Shatter correctly classify regardless of the labelings 19 VC dimension an example 20 VC dimension an example 21 VC dimension an example 22 VC H vs H Any relation between VC H and H 23 VC bounds Bound on m using other quantities Bound on error using other quantities 24 Structural risk minimization 25 Outline Overfitting True training testing errors and overfitting PAC learning finite hypothesis space Consistent learner case and agnostic case PAC learning infinite hypothesis space VC dimension VC bounds structural risk minimization Mistake bounds Find S Halving algorithm weighted majority algorithm Semi supervised learning The general idea EM co training NELL 26 Mistake bounds Consider the following setting Instances draw randomly from X according to the data distribution P X The learner must classify each instance x before knowing its label How many mistakes before the learner converges to the correct concept 27 Mistake bounds Consider the following setting Instances draw randomly from X according to the data distribution P X The learner must classify each instance x before knowing its label How many mistakes before the learner converges to the correct concept Analogy given a pool of experts how many mistakes before we find the true expert Difference from the PAC learning bound Do not care about how many samples we see Care about how many mistakes we make 28 29 Halving algorithm Start from a hypothesis space H Given each new instance x Majority voting from all h in H to classify x Obtain the label of x Remove from H those misclassify x Bound the number of mistakes K 30 31 Weight Majority Algorithm What is there is no perfect function h in the hypothesis space H Can we design an algorithm using H such that mistakes is close to using the best h in H Yes Weighted majority algorithm Assign initial weight one to each h in H Make prediction by weighted majority voting Update the weight of each h in H 32 33 34 35 Outline Overfitting True training testing errors and overfitting PAC learning finite hypothesis space Consistent learner case and agnostic case PAC learning infinite hypothesis space VC dimension VC bounds structural risk minimization Mistake bounds Find S Halving algorithm weighted majority algorithm Semi supervised learning The general idea EM co training NELL 36 Semi supervised learning 37 Semi supervised learning Why do we care Unlabeled data is much easier to obtain How can we use unlabeled data to help 38 EM Learn the initial model using a few labeled data Iterate Use the model to guess unknown labels Re learn the model using labeled unlabeled data 39 EM Any problem The initial model can be inaccurate The guess on unknown labels may be inaccurate Model re learned using inaccurate information 40 Co training and multi view learning Features in X can be split into multiple views Ideally each view is sufficient to predict Y Ideally views are conditionally independent given Y Example hyperlink view page view prof or not 41 Difference to EM Directly assign labels instead of estimating expectation Use two or more models from different views Potential problem Self labeling noise 42 Idea for dealing with self labeling noise Last step Add only consistent self labeled examples to L 43 Semi supervised Learning in NELL NELL never ending language learning Coupled semi supervised learning 44
View Full Document