Learning Theory Learning TheoryLearning TheoryProblem ComplexityProblem ComplexityPAC LearningConsistent LearnersNotions of ErrorLearning TheoryLearning TheoryLearning TheoryLearning TheoryLearning TheoryLearning TheoryLearning TheoryThe Union BoundHaussler, 1988Haussler, 1988Sample ComplexitySample ComplexityDecision TreesDecision TreesGeneralizationsChernoff BoundsChernoff BoundsChernoff BoundsPAC BoundsPAC BoundsPAC BoundsLearning TheoryNicholas RuozziUniversity of Texas at DallasBased on the slides of Vibhav Gogate and David SontagLearning Theory• So far, we’ve been focused only on algorithms for finding the best hypothesis in the hypothesis space• How do we know that the learned hypothesis will perform well on the test set?• How many samples do we need to make sure that we learn a good hypothesis?• In what situations is learning possible?2Learning Theory• If the training data is linearly separable, we saw that perceptron/SVMs will always perfectly classify the training data• This does not mean that it will perfectly classify the test data• Intuitively, if the true distribution of samples is linearly separable, then seeing more data should help us do better3Problem Complexity• Complexity of a learning problem depends on• Size/expressiveness of the hypothesis space• Accuracy to which a target concept must be approximated• Probability with which the learner must produce a successful hypothesis• Manner in which training examples are presented, e.g. randomly or by query to an oracle4Problem Complexity• Measures of complexity• Sample complexity• How much data you need in order to (with high probability) learn a good hypothesis• Computational complexity • Amount of time and space required to accurately solve (with high probability) the learning problem• Higher sample complexity means higher computational complexity5PAC Learning• Probably approximately correct (PAC)• The only reasonable expectation of a learner is that with high probability it learns a close approximation to the target concept• Specify two small parameters, 𝜖𝜖 and 𝛿𝛿, and require that with probability at least (1 −𝛿𝛿) a system learn a concept with error at most 𝜖𝜖6Consistent Learners• Imagine a simple setting• The hypothesis space is finite (i.e., 𝐻𝐻 = 𝑐𝑐)• The true distribution of the data is 𝑝𝑝( 𝑥𝑥), no noisy labels• We learned a perfect classifier on the training set, let’s call it ℎ ∈ H• A learner is said to be consistent if it always outputs a perfect classifier (assuming that one exists)• Want to compute the (expected) error of the classifier7Notions of Error• Training error of ℎ ∈ 𝐻𝐻• The error on the training data• Number of samples incorrectly classified divided by the total number of samples• True error of ℎ ∈ 𝐻𝐻• The error over all possible future random samples• Probability, with respect to the data generating distribution, that ℎ misclassifies a random data point𝑝𝑝 ℎ 𝑥𝑥 ≠ 𝑦𝑦8Learning Theory• Assume that there exists a hypothesis in 𝐻𝐻 that perfectly classifies all data points and that |𝐻𝐻| is finite• The version space (set of consistent hypotheses) is said to be 𝜖𝜖-exhausted if and only if every consistent hypothesis has true error less than 𝜖𝜖• Want enough samples to guarantee that every consistent hypothesis has error at most 𝜖𝜖• We’ll show that, given enough samples, w.h.p. every hypothesis with true error at least 𝝐𝝐is not consistent with the data9Learning Theory• Let 𝑥𝑥1, 𝑦𝑦(1), … , (𝑥𝑥𝑀𝑀, 𝑦𝑦(𝑀𝑀)) be 𝑀𝑀 labeled data points sampled independently according to 𝑝𝑝• Let 𝐶𝐶𝑚𝑚ℎbe a random variable that indicates whether or not the 𝑚𝑚𝑡𝑡ℎdata point is correctly classified• The probability that ℎmisclassifies the 𝑚𝑚𝑡𝑡ℎdata point is𝑝𝑝 𝐶𝐶𝑚𝑚ℎ= 0 = �(𝑥𝑥,𝑦𝑦)𝑝𝑝 𝑥𝑥, 𝑦𝑦 1ℎ 𝑥𝑥 ≠𝑦𝑦= 𝜖𝜖ℎ10Learning Theory• Let 𝑥𝑥1, 𝑦𝑦(1), … , (𝑥𝑥𝑀𝑀, 𝑦𝑦(𝑀𝑀)) be 𝑀𝑀 labeled data points sampled independently according to 𝑝𝑝• Let 𝐶𝐶𝑚𝑚ℎbe a random variable that indicates whether or not the 𝑚𝑚𝑡𝑡ℎdata point is correctly classified• The probability that ℎmisclassifies the 𝑚𝑚𝑡𝑡ℎdata point is𝑝𝑝 𝐶𝐶𝑚𝑚ℎ= 0 = �(𝑥𝑥,𝑦𝑦)𝑝𝑝 𝑥𝑥, 𝑦𝑦 1ℎ 𝑥𝑥 ≠𝑦𝑦= 𝜖𝜖ℎ11Probability that a randomly sampled pair (x,y) is incorrectly classified by hLearning Theory• Let 𝑥𝑥1, 𝑦𝑦(1), … , (𝑥𝑥𝑀𝑀, 𝑦𝑦(𝑀𝑀)) be 𝑀𝑀 labelled data points sampled independently according to 𝑝𝑝• Let 𝐶𝐶𝑚𝑚ℎbe a random variable that indicates whether or not the 𝑚𝑚𝑡𝑡ℎdata point is correctly classified• The probability that ℎmisclassifies the 𝑚𝑚𝑡𝑡ℎdata point is𝑝𝑝 𝐶𝐶𝑚𝑚ℎ= 0 = �(𝑥𝑥,𝑦𝑦)𝑝𝑝 𝑥𝑥, 𝑦𝑦 1ℎ 𝑥𝑥 ≠𝑦𝑦= 𝜖𝜖ℎ12This is the true error of hypothesis ℎLearning Theory• Probability that all data points classified correctly?𝑝𝑝 𝐶𝐶1ℎ= 1, … , 𝐶𝐶𝑚𝑚ℎ= 1 = �𝑖𝑖=1𝑚𝑚𝑝𝑝(𝐶𝐶𝑖𝑖ℎ= 1) = 1 −𝜖𝜖ℎ𝑚𝑚• Probability that a hypothesis ℎ ∈ H whose true error is at least 𝜖𝜖correctly classifies the 𝑚𝑚 data points is then𝑝𝑝 𝐶𝐶1ℎ= 1, … , 𝐶𝐶𝑚𝑚ℎ= 1 ≤ 1 −𝜖𝜖𝑚𝑚≤ 𝑒𝑒−𝜖𝜖𝑚𝑚for 𝜖𝜖 ≤ 113Learning Theory• Probability that all data points classified correctly?𝑝𝑝 𝐶𝐶1ℎ= 1, … , 𝐶𝐶𝑀𝑀ℎ= 1 = �𝑚𝑚=1𝑀𝑀𝑝𝑝(𝐶𝐶𝑚𝑚ℎ= 1) = 1 −𝜖𝜖ℎ𝑀𝑀• Probability that a hypothesis ℎ ∈ H whose true error is at least 𝜖𝜖correctly classifies the 𝑚𝑚 data points is then𝑝𝑝 𝐶𝐶1ℎ= 1, … , 𝐶𝐶𝑚𝑚ℎ= 1 ≤ 1 −𝜖𝜖𝑚𝑚≤ 𝑒𝑒−𝜖𝜖𝑚𝑚for 𝜖𝜖 ≤ 114Learning Theory• Probability that all data points classified correctly?𝑝𝑝 𝐶𝐶1ℎ= 1, … , 𝐶𝐶𝑀𝑀ℎ= 1 = �𝑚𝑚=1𝑀𝑀𝑝𝑝(𝐶𝐶𝑚𝑚ℎ= 1) = 1 −𝜖𝜖ℎ𝑀𝑀• Probability that a hypothesis ℎ ∈ H whose true error is at least 𝜖𝜖correctly
View Full Document