1 Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University March 1, 2011 Today: • Computational Learning Theory • VC dimension • PAC results as quantitative model of overfitting Recommended reading: • Mitchell: Ch. 7 • suggested exercises: 7.1, 7.2, 7.7 What it means [Haussler, 1988]: probability that the version space is not ε-exhausted after m training examples is at most 1. How many training examples suffice?"Suppose we want this probability to be at most δ"2. If then with probability at least (1-δ):"2 Sufficient condition: Holds if learner L requires only a polynomial number of training examples, and processing per example is polynomial3 Question: If H = {h | h: X Y} is infinite, what measure of complexity should we use in place of |H| ? Question: If H = {h | h: X Y} is infinite, what measure of complexity should we use in place of |H| ? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c)4 Question: If H = {h | h: X Y} is infinite, what measure of complexity should we use in place of |H| ? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c) VC dimension of H is the size of this subset Question: If H = {h | h: X Y} is infinite, what measure of complexity should we use in place of |H| ? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c) Informal intuition:5 a labeling of each member of S as positive or negative6 VC(H)=3!Compare to our earlier results based on |H|: How many randomly drawn examples suffice to ε-exhaust VSH,D with probability at least (1-δ)? ie., to guarantee that any hypothesis that perfectly fits the training data is probably (1-δ) approximately (ε) correct Sample Complexity based on VC dimension7 VC dimension: examples Consider X = <, want to learn c:X{0,1} What is VC dimension of • Open intervals: • Closed intervals: xVC dimension: examples Consider X = <, want to learn c:X{0,1} What is VC dimension of • Open intervals: • Closed intervals: xVC(H1)=1 VC(H2)=2 VC(H3)=2 VC(H4)=38 VC dimension: examples What is VC dimension of lines in a plane? • H2 = { ((w0 + w1x1 + w2x2)>0 y=1) } VC dimension: examples What is VC dimension of • H2 = { ((w0 + w1x1 + w2x2)>0 y=1) } – VC(H2)=3 • For Hn = linear separating hyperplanes in n dimensions, VC(Hn)=n+19 For any finite hypothesis space H, can you give an upper bound on VC(H) in terms of |H| ? (hint: yes) Can you give an upper bound on VC(H) in terms of |H|, for any hypothesis space H? (hint: yes)10 More VC Dimension Examples to Think About • Logistic regression over n continuous features – Over n boolean features? • Linear SVM over n continuous features • Decision trees defined over n boolean features F: <X1, ... Xn> Y!• Decision trees of depth 2 defined over n features • How about 1-nearest neighbor? How tight is this bound?!How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct?!Tightness of Bounds on Sample Complexity11 How tight is this bound?!How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct?!Tightness of Bounds on Sample Complexity Lower bound on sample complexity (Ehrenfeucht et al., 1989):!Consider any class C of concepts such that VC(C) > 1, any learner L, any 0 < ε < 1/8, and any 0 < δ < 0.01. Then there exists a distribution and a target concept in C, such that if L observes fewer examples than !Then with probability at least δ, L outputs a hypothesis with !Agnostic Learning: VC Bounds With probability at least (1-δ) every h ∈ H satisfies [Schölkopf and Smola, 2002]12 Structural Risk Minimization Which hypothesis space should we choose? • Bias / variance tradeoff H1 H2 H3 H4 [Vapnik] SRM: choose H to minimize bound on expected true error! * unfortunately a somewhat loose bound... PAC Learning: What You Should Know • PAC learning: Probably (1-δ) Approximately (error ε) Correct • Problem setting • Finite H, perfectly consistent learner result • If target function is not in H, agnostic learning • If |H| = ∞ , use VC dimension to characterize H • Most important: – Sample complexity grows with complexity of H – Quantitative characterization of overfitting • Much more: see Prof. Blum’s course on Computational Learning
View Full Document