Machine Learning 1010 701 15701 15 781 Spring 2008 Computational Learning Theory Eric Xing Lecture 11 February 20 2008 Reading Chap 7 T M book Complexity of Learning z The complexity of leaning is measured mainly along two axis Information and computation computation The Information complexity is concerned with the generalization performance of learning z How many training examples are needed z How fast do learner s estimate converge to the true population parameters Etc The Computational complexity concerns the computation resources applied to the training data to extract from it learner s predictions It seems that when an algorithm improves with respect to one of these measures it deteriorates with respect to the other 1 What General Laws constrain Inductive Learning z Sample Complexity z z Computational Complexity z z How many training examples are sufficient to learn target concept Resources required to learn target concept Want theory to relate z Training examples z Quantity z Quality z How presented Th m z Complexity of hypothesis concept space H z Accuracy of approx to target concept z Probability of successful learning ese r esu lts on ly us efu l wr tO Prototypical concept learning task Binary classification z z Everything we ll say here generalizes to other including regression and multiclass classification problems Given z Instances X Possible days each described by the attributes Sky AirTemp Humidity Wind Water Forecast z Target function c EnjoySport X 0 1 z Hypotheses space H Conjunctions of literals E g z Training examples S iid positive and negative examples of the target function Cold High EnjoySport x1 c x1 xm c xm z Determine z A hypothesis h in H such that h x is good w r t c x for all x in S z A hypothesis h in H such that h x is good w r t c x for all x in the true dist D 2 Two Basic Competing Models PAC framework Agnostic framework Sample labels are consistent with some h in H No prior restriction on the sample labels Learner s hypothesis required to meet absolute upper bound on its error The required upper bound on the hypothesis error is only relative to the best hypothesis in the class Sample Complexity z How many training examples are sufficient to learn the target concept z Training senarios 1 If learner proposes instances as queries to teacher z 2 If teacher who knows c provides training examples z 3 Learner proposes instance x teacher provides c x teacher provides sequence of examples of form x c x If some random process e g nature proposes instances z instance x generated randomly teacher provides c x 3 Te mp Pre ss Y Pale N Clear N Pale No 10 87 No Yes Learner Given z z disea seX 95 110 Protocol z Colo ur So 35 22 set of examples X z fixed unknown distribution D over X z set of hypotheses H z set of possible target concepts C Te mp P r e s s 3 2 9 0 Sor eThr oat N C ol o r P a l e Classifier disease X No Learner observes sample S xi c xi z instances xi drawn from distr D z labeled by target concept c C Learner does NOT know c D z Learner outputs h H estimating c z z h is evaluated by performance on subsequent instances drawn from D For now z C H so c H z Noise free data True error of a hypothesis z Definition The true error denoted D h of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D 4 Two notions of error z Training error of hypothesis h with respect to target concept c z z How often h x c x over training instance from S True error of hypothesis h with respect to c z How often h x c x over future random instances drew iid from D Can we bound in terms of The Union Bound z Lemma The union bound Let A1 A2 Ak be k different events that may not be independent Then z In probability theory the union bound is usually stated as an axiom and thus we won t try to prove it but it also makes intuitive sense The probability of any one of k events happening is at most the sums of the probabilities of the k different events 5 Hoeffding inequality z Lemma Hoeding inequality Let Z1 Zm be m independent and identically distributed iid random variables drawn from a Bernoulli distribution i e P Zi 1 and P Zi 0 1 Let be the mean of these random variables and let any 0 be fixed Then z This lemma which in learning theory is also called the Chernoff bound says that if we take the average of m Bernoulli random variables to be our estimate of then the probability of our being far from the true value is small so long as m is large Version Space z z A hypothesis h is consistent with a set of training examples S of target concept c if and only if h x c x for each training example xi c xi in S The version space VSH S with respect to hypothesis space H and training examples S is the subset of hypotheses from H consistent with all training examples in S 6 Exhausting the version space z Definition The version space VSH S is said to be exhausted with respect to c and S if every hypothesis h in VSH S has error less than with respect to c and S Probably Approximately Correct Goal PAC Learner produces hypothesis that is approximately correct errD 0 with high probability P errD 0 1 z Double hedging z approximately z probably Need both 7 How many examples will exhaust the VS Theorem Haussler 1988 z If the hypothesis space H is finite and S is a sequence of m 1 independent random examples of some target concept c then for any 0 1 the probability that the version space with respect to H and D is not exhausted with respect to c is less than H e m z z This bounds the probability that any consistent learner will output a hypothesis h with h If we want to this probability to be below H e m z then m 1 ln H ln 1 Proof 8 What it means z Haussler 1988 probability that the version space is not exhausted after m training examples is at most H e m Suppose we want this probability to be at most 1 2 How many training examples suffice If errortrain h 0 then with probability at least 1 Learning Conjunctions of Boolean Literals z How many examples are sufficient to assure with probability at least 1 that every h in VSH S satisfies S h z Use our theorem z Suppose H contains conjunctions of constraints on up to n boolean attributes i e n boolean literals m 1 …
View Full Document