Computational Learning Theory – Part 2 Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University November 3, 2010 Reading: • Mitchell chapter 7 Suggested exercises: • 7.1, 7.2, 7.5, 7.7What it means [Haussler, 1988]: probability that the version space is not ε-exhausted after m training examples is at most 1. How many training examples suffice?"Suppose we want this probability to be at most δ"2. If then with probability at least (1-δ):"Sufficient condition: Holds if L requires only a polynomial number of training examples, and processing per example is polynomialQuestion: If H = {h | h: X Y} is infinite, what measure of complexity should we use in place of |H| ?Question: If H = {h | h: X Y} is infinite, what measure of complexity should we use in place of |H| ? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c)Question: If H = {h | h: X Y} is infinite, what measure of complexity should we use in place of |H| ? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c) VC dimension of H is the size of this subsetQuestion: If H = {h | h: X Y} is infinite, what measure of complexity should we use in place of |H| ? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c) Informal intuition:a labeling of each member of S as positive or negativeVC(H)=3!Compare to our earlier results based on |H|: How many randomly drawn examples suffice to ε-exhaust VSH,D with probability at least (1-δ)? ie., to guarantee that any hypothesis that perfectly fits the training data is probably (1-δ) approximately (ε) correct Sample Complexity based on VC dimensionVC dimension: examples Consider X = <, want to learn c:X{0,1} What is VC dimension of • Open intervals: • Closed intervals: xVC dimension: examples Consider X = <, want to learn c:X{0,1} What is VC dimension of • Open intervals: • Closed intervals: xVC(H1)=1 VC(H2)=2 VC(H3)=2 VC(H4)=3VC dimension: examples What is VC dimension of lines in a plane? • H2 = { ((w0 + w1x1 + w2x2)>0 y=1) }VC dimension: examples What is VC dimension of • H2 = { ((w0 + w1x1 + w2x2)>0 y=1) } – VC(H2)=3 • For Hn = linear separating hyperplanes in n dimensions, VC(Hn)=n+1For any finite hypothesis space H, can you give an upper bound on VC(H) in terms of |H| ? (hint: yes)More VC Dimension Examples to Think About • Logistic regression over n continuous features – Over n boolean features? • Linear SVM over n continuous features • Decision trees defined over n boolean features F: <X1, ... Xn> Y!• Decision trees of depth 2 defined over n features • How about 1-nearest neighbor?How tight is this bound?!How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct?!Tightness of Bounds on Sample ComplexityHow tight is this bound?!How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct?!Tightness of Bounds on Sample Complexity Lower bound on sample complexity (Ehrenfeucht et al., 1989):!Consider any class C of concepts such that VC(C) > 1, any learner L, any 0 < ε < 1/8, and any 0 < δ < 0.01. Then there exists a distribution and a target concept in C, such that if L observes fewer examples than !Then with probability at least δ, L outputs a hypothesis with !Agnostic Learning: VC Bounds With probability at least (1-δ) every h ∈ H satisfies [Schölkopf and Smola, 2002]Structural Risk Minimization Which hypothesis space should we choose? • Bias / variance tradeoff H1 H2 H3 H4 [Vapnik] SRM: choose H to minimize bound on true error! * unfortunately a somewhat loose bound...What You Should Know • Sample complexity varies with the learning setting – Learner actively queries trainer – Examples arrive at random – … • Within the PAC learning setting, we can bound the probability that learner will output hypothesis with given error – For ANY consistent learner (case where c ∈ H) – For ANY “best fit” hypothesis (agnostic learning, where perhaps c not in H) • VC dimension as measure of complexity of H • Mistake bounds • Conference on Learning Theory: http://www.learningtheory.org • Avrim Blum’s course on Machine Learning Theory: –
View Full Document