1 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 20, 2011 Today: • Computational Learning Theory • PAC learning theorem • VC dimension Recommended reading: • Mitchell: Ch. 7 • suggested exercises: 7.1, 7.2, 7.7 Computational Learning Theory • What general laws constrain inductive learning? • Want theory to relate – Probability of successful learning – Number of training examples – Complexity of hypothesis space – Accuracy to which target function is approximated – Manner in which training examples are presented * See annual Conference on Computational Learning Theory2 Sample Complexity How many training examples suffice to learn target concept 1. If learner proposes instances as queries to teacher? - learner proposes x, teacher provides f(x) 2. If teacher (who knows f(x)) proposes training examples? - teacher proposes sequence {<x1, f(x1)>, … <xn, f(xn)> 3. If some random process (e.g., nature) proposes instances, and teacher labels them? - instances drawn according to P(X) Function Approximation: The Big Picture3 Sample Complexity 3 Problem setting: • Set of instances • Set of hypotheses • Set of possible target functions • Sequence of training instances drawn at random from teacher provides noise-free label Learner outputs a hypothesis such that The true error of h is the probability that it will misclassify an example drawn at random from4 Dinstances!drawn at random from !Probability distribution P(X) training examples D Overfitting Consider a hypothesis h and its • Error rate over training data: • True error rate over all data: We say h overfits the training data if Amount of overfitting =5 Overfitting Consider a hypothesis h and its • Error rate over training data: • True error rate over all data: We say h overfits the training data if Amount of overfitting = Can we bound in terms of ?? Probability distribution P(x) training examples if D was a set of examples drawn from and independent of h, then we could use standard statistical confidence intervals to determine that with 95% probability, lies in the interval: but D is the training data for h …. xєD DxєD DxєD DxєD DxєD DxєD D6 c: X à {0,1} Function Approximation: The Big Picture78 Any(!) learner that outputs a hypothesis consistent with all training examples (i.e., an h contained in VSH,D)9 What it means [Haussler, 1988]: probability that the version space is not ε-exhausted after m training examples is at most 1. How many training examples suffice?"Suppose we want this probability to be at most δ"2. If then with probability at least (1-δ):"Example: Simple decision trees Consider Boolean classification problem • instances: X = <X1 … XN> where each Xi is boolean • Each hypothesis in H is a decision tree of depth 1 How many training examples m suffice to assure that with probability at least 0.99, any consistent learner using H will output a hypothesis with true error at most 0.05?10 Example: H is Conjunction of up to N Boolean Literals Consider classification problem f:XàY: • instances: X = <X1 X2 X3 X4> where each Xi is boolean • Each hypothesis in H is a rule of the form: – IF <X1 X2 X3 X4> = <0,?,1,?> , THEN Y=1, ELSE Y=0 – i.e., rules constrain any subset of the Xi How many training examples m suffice to assure that with probability at least 0.99, any consistent learner using H will output a hypothesis with true error at most 0.05?11 Sufficient condition: Holds if learner L requires only a polynomial number of training examples, and processing per example is polynomial Here ε is the difference between the training error and true error of the output hypothesis (the one with lowest training error)12 Additive Hoeffding Bounds – Agnostic Learning • Given m independent flips of a coin with true Pr(heads) = θ" we can bound the error in the maximum likelihood estimate • Relevance to agnostic learning: for any single hypothesis h • But we must consider all hypotheses in H • So, with probability at least (1-δ) every h satisfies General Hoeffding Bounds • When estimating parameter θ inside [a,b] from m examples • When estimating a probability θ is inside [0,1], so • And if we’re interested in only one-sided error, then13 Question: If H = {h | h: X à Y} is infinite, what measure of complexity should we use in place of |H| ? Question: If H = {h | h: X à Y} is infinite, what measure of complexity should we use in place of |H| ? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c)14 Question: If H = {h | h: X à Y} is infinite, what measure of complexity should we use in place of |H| ? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c) VC dimension of H is the size of this subset Question: If H = {h | h: X à Y} is infinite, what measure of complexity should we use in place of |H| ? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c) Informal intuition:15 a labeling of each member of S as positive or negative16 VC(H)=3!Compare to our earlier results based on |H|: How many randomly drawn examples suffice to ε-exhaust VSH,D with probability at least (1-δ)? ie., to guarantee that any hypothesis that perfectly fits the training data is probably (1-δ) approximately (ε) correct Sample Complexity based on VC dimension17 VC dimension: examples Consider X = <, want to learn c:Xà{0,1} What is VC dimension of • Open intervals: • Closed intervals: xVC dimension: examples Consider X = <, want to learn c:Xà{0,1} What is VC dimension of • Open intervals: • Closed intervals: xVC(H1)=1 VC(H2)=2 VC(H3)=2 VC(H4)=318 VC dimension: examples What is VC dimension of lines in a plane? • H2 = { ((w0 + w1x1 + w2x2)>0 à y=1) } VC dimension: examples What is VC dimension of • H2 = { ((w0 + w1x1 + w2x2)>0 à y=1) } – VC(H2)=3 • For Hn = linear separating hyperplanes in n dimensions, VC(Hn)=n+119 For any finite hypothesis space H, can you give an upper bound on VC(H) in terms of |H| ? (hint: yes)
View Full Document