Unformatted text preview:

Statistical Learning Theory The specific Question we address here is What are the theoretical bounds on the error rate of h on new data points General Assumptions Noise Free Case 1 Assumptions on data S Examples are iid independently identically distributed generated according to a probability distribution D x and labeled according to an unknown function y f x classification 2 Assumptions on learning algorithm The learning algorithm is given m examples in S and outputs a hypothesis h H that is consistent with S i e correctly classifies them all 3 Goal Assumption h should fit new examples well low error rate that are draw according to the same distribution D x error h f PD x f x h x PAC Probably Approximately Correct Learning We allow algorithms to fail with probability Procedure 1 Draw m random samples S 2 Run a learning algorithm and generate h Since S is iid we cannot guarantee that the data will be representative Want to ensure that 1 of the time the hypothesis error is less than PDm error f h Ex want to obtain a 90 9 correct hypothesis 95 0 05 of the time Case 1 Finite Hypothesis Spaces Assume H is finite Consider h1 H such that error h f bad Given one training example x1 y1 the probability that h1 classifies it correctly is P h1 x1 y1 1 Given m training examples x1 y1 x m ym the probability that h1 classifies it correctly is m P h1 x1 y1 h1 x m ym 1 Now assume we have a second hypothesis h2 also bad What is the probability that either h1 or h2 will be correct PDm h1 correct h2 correct PDm h1 correct PDm h2 correct PDm h1 correct h2 correct PDm h1 correct PDm h2 correct m 2 1 Therefore for k bad hypothesis the probability that any one of them are correct is m k 1 Since k H m H 1 Inequality 0 1 1 e m H 1 H e m Lemma For a finite hypothesis space H given a set of m training examples drawn independently according to D the probability that there exists an hypothesis h H with true error greater than consistent with the training examples is less than or equal to H e m Therefore for probability less than H e m This is true whenever 1 1 m ln H ln Blumer bound Blumer Ehrenfeucht Haussler and Warmuth 1987 Therefore if h H is consistent and all m samples are independently drawn according to D the error rate on new data points is bounded by 1 1 ln H ln m Example applications Boolean Conjunctions over n features o Three possibilities x j x j or not present Therefore for n features H 3n o 1 1 n ln 3 ln m Finite Hypothesis Spaces Inconsistent Hypothesis If h does not perfectly fit the data but has error rate of S S ln H ln 1 2m Therefore larger than the error rate on S Case 2 Infinite Hypothesis Spaces Even if H H has limited expressive power therefore we should still be able to obtain bounds Definition Let S x1 x m be a set of m examples A hypothesis space H can trivially fit S if for every possible labeling of the examples in S there exists an h H that gives this labeling If so than H is said to shatter S Definition The Vapnik Chervonenkis dimension VC dimension of a hypothesis space H is the size of the largest set of examples that can be trivially fit shattered by H Note if H is finite VC H log 2 H VC Dimension Example Let H be the set of all intervals on the real line If h x 1 than x is in the interval If h x 0 than x is NOT in the interval H can trivially fit shatter any two points However can H trivially fit shatter three points No Therefore the VC dimension is 2 Error Bound for Infinite Hypothesis Spaces Theorem Suppose that VC H d Assume that there are m training examples in S and that a learning algorithm finds an h H with error rate S on S Then with probability 1 the error rate on new data is 4 2em 4 2 S d log log m d Called the Empirical Risk Minimization Principle Vapnik However this does not work well for fixed hypothesis spaces because you learning algorithm will minimize S Underfitting Every hypothesis H has high error S Want to consider H that has larger space Overfitting Every hypothesis H has high error S 0 Want to consider H smaller hypothesis spaces that have lower d Suppose we have a nested series of hypothesis spaces H1 H 2 H k with corresponding VC dimensions and errors d1 d 2 d k 1S S2 Sk Then you should use the Structural Risk Minimization Principle Vapnik Choose the hypothesis space H k that minimizes the combined error bounds 4 2em 4 2 Sk d k log log m dk


View Full Document

CU-Boulder CSCI 4202 - Statistical Learning Theory

Documents in this Course
Load more
Download Statistical Learning Theory
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Statistical Learning Theory and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Statistical Learning Theory and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?