More Learning Theory Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time Probably approximately correct PAC The only reasonable expectation of a learner is that with high probability it learns a close approximation to the target concept Specify two small parameters is the error of the approximation 0 0 1 is the probability that given i i d samples our learning algorithm produces a classifier with error at most 1 2 Learning Theory We use the observed data in order to learn a classifier Want to know how far the learned classifier deviates from the unknown underlying distribution With too few samples we will with high probability learn a classifier whose true error is quite high even though it may be a perfect classifier for the observed data As we see more samples we pick a classifier from the hypothesis space with low training error hope that it also has low true error Want this to be true with high probability can we bound how many samples that we need 3 Haussler 1988 What we proved last time Theorem For a finite hypothesis space samples and classifier has true error larger than the probability that any consistent is at most with i i d 0 1 We can turn this into a sample complexity bound 4 Sample Complexity Let be an upper bound on the desired probability of not exhausting the sample space The probability that the version space is not exhausted is at most Solving for yields ln 1 ln ln 1 5 PAC Bounds Theorem For a finite hypothesis space H i i d samples and the probability that true error of any of the best classifiers i e lowest training error is larger than its training error 0 1 plus is at most 2 2 Sample complexity for desired ln ln 2 2 2 2 1 6 PAC Bounds If we require that the previous error is bounded above by a fixed then with probability for all 1 1 2 ln ln 1 bias variance Follows from Chernoff bound 2 2 2 2 7 PAC Bounds If we require that the previous error is bounded above by then with probability for all 1 1 2 ln ln 1 bias variance Follows from Chernoff bound ln ln 1 2 1 2 2 8 PAC Bounds If we require that the previous error is bounded above by then with probability for all 1 1 2 ln ln 1 bias variance High bias may not be enough hypotheses to choose For small from Low variance 9 PAC Bounds If we require that the previous error is bounded above by then with probability for all 1 bias 1 2 1 ln ln variance For large Low bias lots of good hypotheses High variance 10 PAC Learning Given Set of data Hypothesis space Set of target concepts Training instances from unknown probability distribution over of the form Goal Learn the target concept C 11 PAC Learning Given A concept class A learner Two constants over at least learner 1 over instances from the set with hypothesis space using 1 2 0 by sampling outputs a hypothesis instances will with probability such that is said to be PAC learnable by iff for all distributions Running time is polynomial in 1 1 H 12 VC Dimension Our analysis for the finite case was based on If isn t finite this translates into infinite sample complexity We can derive a different notion of complexity for infinite hypothesis spaces by considering only the number of points that can be correctly classified by some member of We will only consider the binary classification case for now 13 VC Dimension What is the largest number of data points in 1 D that can be correctly classified by a linear separator regardless of their assigned labels 2 points Yes 14 VC Dimension What is the largest number of data points in 1 D that can be correctly classified by a linear separator regardless of their assigned labels 2 points Yes 15 VC Dimension What is the largest number of data points in 1 D that can be correctly classified by a linear separator regardless of their assigned labels 2 points Yes 16 VC Dimension What is the largest number of data points in 1 D that can be correctly classified by a linear separator regardless of their assigned labels 3 points Yes 17 VC Dimension What is the largest number of data points in 1 D that can be correctly classified by a linear separator regardless of their assigned labels 3 points NO 18 VC Dimension What is the largest number of data points in 1 D that can be correctly classified by a linear separator regardless of their assigned labels 3 points NO 3 points and up for any collection of three or more there is always some choice of pluses and minuses such that that the points cannot be classified with a linear separator in one dimension 19 VC Dimension A set of points is shattered by a hypothesis space if and only if for every partition of the set of points into positive and negative examples there exists some consistent The Vapnik Chervonenkis VC dimension of is the size of the largest finite subset of over inputs from shattered by 20 VC Dimension Common misconception VC dimension is determined by the largest shatterable set of points not the highest number such that all sets of points that size can be shattered Cannot be shattered by a line 21 VC Dimension Common misconception VC dimension is determined by the largest shatterable set of points not the highest number such that all sets of points that size can be shattered Can be shattered by a line no matter the labels so VC dimension is at least 3 22 VC Dimension What is the VC dimension of 2 D space under linear separators It is at least three from the last slide Can some set of four points be shattered 23 VC Dimension What is the VC dimension of 2 D space under linear separators It is at least three from the last slide Can some set of four points be shattered 24 VC Dimension What is the VC dimension of 2 D space under linear separators It is at least three from the last slide Can some set of four points be shattered 25 VC Dimension What is the VC dimension of 2 D space under linear separators It is at least three from the last slide Can some set of four points be shattered NO This means that the VC dimension is at most 3 26 VC Dimension There exists a set of size in a space that can be shattered by a linear separator but not a set of size 1 The larger the subset of that can be shattered the more 2 expressive the hypothesis space is If arbitrarily large finite subsets of can be shattered then 27 Axis Parallel Rectangles Let be the set of all points in Let be the set of all axis parallel rectangles in 2 D 2 inside outside What …
View Full Document