UTD CS 6375 - Learning Theory - D3522912

Home> Schools> The University of Texas at Dallas> Computer Science (CS) > CS 6375> Learning Theory

UTD CS 6375 - Learning Theory

School name The University of Texas at Dallas

Pages 29

Download Save

Unformatted text preview:

Learning Theory Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Learning Theory So far we ve been focused only on algorithms for finding the best hypothesis in the hypothesis space How do we know that the learned hypothesis will perform well on the test set good hypothesis How many samples do we need to make sure that we learn a In what situations is learning possible 2 Learning Theory If the training data is linearly separable we saw that perceptron SVMs will always perfectly classify the training data This does not mean that it will perfectly classify the test data Intuitively if the true distribution of samples is linearly separable then seeing more data should help us do better 3 Problem Complexity Complexity of a learning problem depends on Size expressiveness of the hypothesis space Accuracy to which a target concept must be approximated Probability with which the learner must produce a successful hypothesis Manner in which training examples are presented e g randomly or by query to an oracle 4 Problem Complexity Measures of complexity Sample complexity How much data you need in order to with high probability learn a good hypothesis Computational complexity Amount of time and space required to accurately solve with high probability the learning problem Higher sample complexity means higher computational complexity 5 PAC Learning Probably approximately correct PAC The only reasonable expectation of a learner is that with high probability it learns a close approximation to the target concept Specify two small parameters and and require that with probability at least error at most a system learn a concept with 1 6 Consistent Learners Imagine a simple setting The hypothesis space is finite i e The true distribution of the data is no noisy labels We learned a perfect classifier on the training set let s call it H A learner is said to be consistent if it always outputs a perfect classifier assuming that one exists Want to compute the expected error of the classifier 7 Notions of Error Training error of The error on the training data number of samples True error of Number of samples incorrectly classified divided by the total The error over all possible future random samples Probability with respect to the data generating distribution that misclassifies a random data point 8 Learning Theory Assume that there exists a hypothesis in that perfectly classifies all data points and that is finite The version space set of consistent hypotheses is said to be exhausted if and only if every consistent hypothesis has true error less than Want enough samples to guarantee that every consistent hypothesis has error at most We ll show that given enough samples w h p every hypothesis with true error at least data is not consistent with the 9 Learning Theory Let be labeled data points 1 1 sampled independently according to be a random variable that indicates whether or not the Let data point is correctly classified The probability that misclassifies the data point is 0 1 10 Learning Theory Let be labeled data points 1 1 sampled independently according to be a random variable that indicates whether or not the Let data point is correctly classified The probability that misclassifies the data point is 0 1 Probability that a randomly sampled pair x y is incorrectly classified by h 11 Learning Theory Let be labelled data points 1 1 sampled independently according to be a random variable that indicates whether or not the Let data point is correctly classified The probability that misclassifies the data point is 0 1 This is the true error of hypothesis 12 Learning Theory Probability that all data points classified correctly 1 1 1 1 Probability that a hypothesis whose true error is at least 1 1 correctly classifies the data points is then H 1 1 1 1 for 1 13 Learning Theory Probability that all data points classified correctly 1 1 1 1 1 1 Probability that a hypothesis whose true error is at least correctly classifies the data points is then H 1 1 1 1 for 1 14 Learning Theory Probability that all data points classified correctly 1 1 1 1 1 1 Probability that a hypothesis whose true error is at least correctly classifies the data points is then H 1 for 1 1 1 1 1 15 The Union Bound Let at least From before for each be the set of all hypotheses that have true error So the probability that some correctly classifies all data points correctly classifies all of the data points is 1 1 1 1 1 1 16 Haussler 1988 What we just proved Theorem For a finite hypothesis space i i d the probability that the version with exhausted is at most 0 1 samples and space is not We can turn this into a sample complexity bound 17 Haussler 1988 What we just proved Theorem For a finite hypothesis space with i i d samples and hypothesis in error larger than the probability that there exists a that is consistent with the data but has true 0 1 is at most We can turn this into a sample complexity bound 18 Sample Complexity Let be an upper bound on the desired probability of not exhausting the sample space That is the probability that the version space is not exhausted is at most Solving for yields ln 1 ln ln 1 19 Sample Complexity Let be an upper bound on the desired probability of not exhausting the sample space That is the probability that the version space is not exhausted is at most Solving for yields ln 1 ln ln 1 20 This is sufficient but not necessary union bound is quite loose Decision Trees Suppose that we want to learn an arbitrary Boolean function given Boolean features Hypothesis space consists of all decision trees Size of this space How many samples are sufficient 21 Decision Trees Suppose that we want to learn an arbitrary Boolean function given Boolean features Hypothesis space consists of all decision trees Size of this space number of Boolean functions on inputs 2 2 How many samples are sufficient ln 2 ln 2 1 22 Generalizations How do we handle situations with no perfect classifier Pick the hypothesis with the lowest error on the training set What do we do if the hypothesis space isn t finite Infinite sample complexity Coming soon 23 Chernoff Bounds Chernoff bound Suppose are i i d random variables taking values in such that 1 For 0 1 0 2 1 2 2 24 Chernoff Bounds Chernoff bound Suppose are i i d random variables taking values in such that 1 For 0 1 0 2 Applying this to gives 1 2 2 1 1 1 1 1 2 2 2 25 Chernoff Bounds Chernoff bound Suppose are i i d random variables taking

View Full Document


School:
Email:
New Password:
Confirm Password:

UTD CS 6375 - Learning Theory

Sign up for free to view:

Please select your school