1 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 25, 2011 Today: • PAC learning • VC dimension Recommended reading: • Mitchell: Ch. 7 • suggested exercises: 7.1, 7.2, 7.7 PAC Learning Problem Setting Problem setting: • Set of instances • Set of hypotheses • Set of possible target functions • Sequence of training instances drawn at random from teacher provides noise-free label Learner outputs a hypothesis such that2 Overfitting Consider a hypothesis h and its • Error rate over training data: • True error rate over all data: We say h overfits the training data if Amount of overfitting = What it means [Haussler, 1988]: probability that the version space is not ε-exhausted after m training examples is at most 1. How many training examples suffice?"Suppose we want this probability to be at most δ"2. If then with probability at least (1-δ):"3 Here ε is the difference between the training error and true error of the output hypothesis (the one with lowest training error) Function Approximation: The Big Picture45 Sufficient condition: Holds if learner L requires only a polynomial number of training examples, and processing per example is polynomial Question: If H = {h | h: X à Y} is infinite, what measure of complexity should we use in place of |H| ?6 Question: If H = {h | h: X à Y} is infinite, what measure of complexity should we use in place of |H| ? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c) Question: If H = {h | h: X à Y} is infinite, what measure of complexity should we use in place of |H| ? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c) VC dimension of H is the size of this subset7 Question: If H = {h | h: X à Y} is infinite, what measure of complexity should we use in place of |H| ? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c) Informal intuition: a labeling of each member of S as positive or negative8 VC(H)=3!Compare to our earlier results based on |H|: How many randomly drawn examples suffice to ε-exhaust VSH,D with probability at least (1-δ)? ie., to guarantee that any hypothesis that perfectly fits the training data is probably (1-δ) approximately (ε) correct Sample Complexity based on VC dimension9 VC dimension: examples Consider X = <, want to learn c:Xà{0,1} What is VC dimension of • Open intervals: • Closed intervals: xVC dimension: examples Consider X = <, want to learn c:Xà{0,1} What is VC dimension of • Open intervals: • Closed intervals: xVC(H1)=1 VC(H2)=2 VC(H3)=2 VC(H4)=310 VC dimension: examples What is VC dimension of lines in a plane? • H2 = { ((w0 + w1x1 + w2x2)>0 à y=1) } VC dimension: examples What is VC dimension of • H2 = { ((w0 + w1x1 + w2x2)>0 à y=1) } – VC(H2)=3 • For Hn = linear separating hyperplanes in n dimensions, VC(Hn)=n+111 For any finite hypothesis space H, can you give an upper bound on VC(H) in terms of |H| ? (hint: yes) More VC Dimension Examples to Think About • Logistic regression over n continuous features – Over n boolean features? • Linear SVM over n continuous features • Decision trees defined over n boolean features F: <X1, ... Xn> à Y! • Decision trees of depth 2 defined over n features • How about 1-nearest neighbor?12 How tight is this bound?!How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct?!Tightness of Bounds on Sample Complexity How tight is this bound?!How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct?!Tightness of Bounds on Sample Complexity Lower bound on sample complexity (Ehrenfeucht et al., 1989):!Consider any class C of concepts such that VC(C) > 1, any learner L, any 0 < ε < 1/8, and any 0 < δ < 0.01. Then there exists a distribution and a target concept in C, such that if L observes fewer examples than !Then with probability at least δ, L outputs a hypothesis with !13 Agnostic Learning: VC Bounds With probability at least (1-δ) every h ∈ H satisfies [Schölkopf and Smola, 2002] Structural Risk Minimization Which hypothesis space should we choose? • Bias / variance tradeoff H1 H2 H3 H4 [Vapnik] SRM: choose H to minimize bound on expected true error! * unfortunately a somewhat loose bound...14 PAC Learning: What You Should Know • PAC learning: Probably (1-δ) Approximately (error ε) Correct • The PAC learning problem setting • Finite H, perfectly consistent learner result • If target function is not in H, agnostic learning • If |H| = ∞ , can use VC dimension to characterize H • Most important: – Sample complexity grows with complexity of H – Quantitative characterization of overfitting • Much more: see Prof. Blum’s course on Computational Learning
View Full Document