Machine Learning 10 701 Tom M Mitchell Machine Learning Department Carnegie Mellon University March 1 2011 Today Recommended reading Computational Learning Theory VC dimension PAC results as quantitative model of overfitting Mitchell Ch 7 suggested exercises 7 1 7 2 7 7 What it means Haussler 1988 probability that the version space is not exhausted after m training examples is at most Suppose we want this probability to be at most 1 How many training examples suffice 2 If then with probability at least 1 1 Sufficient condition Holds if learner L requires only a polynomial number of training examples and processing per example is polynomial 2 Question If H h h X Y is infinite what measure of complexity should we use in place of H Question If H h h X Y is infinite what measure of complexity should we use in place of H Answer The largest subset of X for which H can guarantee zero training error regardless of the target function c 3 Question If H h h X Y is infinite what measure of complexity should we use in place of H Answer The largest subset of X for which H can guarantee zero training error regardless of the target function c VC dimension of H is the size of this subset Question If H h h X Y is infinite what measure of complexity should we use in place of H Answer The largest subset of X for which H can guarantee zero training error regardless of the target function c Informal intuition 4 a labeling of each member of S as positive or negative 5 VC H 3 Sample Complexity based on VC dimension How many randomly drawn examples suffice to exhaust VSH D with probability at least 1 ie to guarantee that any hypothesis that perfectly fits the training data is probably 1 approximately correct Compare to our earlier results based on H 6 VC dimension examples Consider X want to learn c X 0 1 What is VC dimension of Open intervals x Closed intervals VC dimension examples Consider X want to learn c X 0 1 What is VC dimension of Open intervals x VC H1 1 VC H2 2 Closed intervals VC H3 2 VC H4 3 7 VC dimension examples What is VC dimension of lines in a plane H2 w0 w1x1 w2x2 0 y 1 VC dimension examples What is VC dimension of H2 w0 w1x1 w2x2 0 y 1 VC H2 3 For Hn linear separating hyperplanes in n dimensions VC Hn n 1 8 For any finite hypothesis space H can you give an upper bound on VC H in terms of H hint yes Can you give an upper bound on VC H in terms of H for any hypothesis space H hint yes 9 More VC Dimension Examples to Think About Logistic regression over n continuous features Over n boolean features Linear SVM over n continuous features Decision trees defined over n boolean features F X1 Xn Y Decision trees of depth 2 defined over n features How about 1 nearest neighbor Tightness of Bounds on Sample Complexity How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably 1 approximately correct How tight is this bound 10 Tightness of Bounds on Sample Complexity How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably 1 approximately correct How tight is this bound Lower bound on sample complexity Ehrenfeucht et al 1989 Consider any class C of concepts such that VC C 1 any learner L any 0 1 8 and any 0 0 01 Then there exists a distribution and a target concept in C such that if L observes fewer examples than Then with probability at least L outputs a hypothesis with Agnostic Learning VC Bounds Sch lkopf and Smola 2002 With probability at least 1 every h H satisfies 11 Structural Risk Minimization Vapnik Which hypothesis space should we choose Bias variance tradeoff H4 H3 H2 H1 SRM choose H to minimize bound on expected true error unfortunately a somewhat loose bound PAC Learning What You Should Know PAC learning Probably 1 Approximately error Correct Problem setting Finite H perfectly consistent learner result If target function is not in H agnostic learning If H use VC dimension to characterize H Most important Sample complexity grows with complexity of H Quantitative characterization of overfitting Much more see Prof Blum s course on Computational Learning Theory 12
View Full Document