MIT 6 867 - Lecture Notes - D1721828

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 867> Lecture Notes

MIT 6 867 - Lecture Notes

School name Massachusetts Institute of Technology

Course 6 867- Machine Learning

Pages 25

Download Save

Unformatted text preview:

Machine learning lecture 11 Tommi S Jaakkola MIT AI Lab tommi ai mit edu Topics Complexity and model selection learning and VC dimension structural risk minimization Complexity compression and model selection description length minimum description length principle Tommi Jaakkola MIT AI Lab 2 VC dimension review The complexity of a set of classifiers depends on how many different ways we can label n training points x1 xn with classifiers h F In other words this is the number of distinct binary vectors h x1 h x2 h xn 1 1 1 1 1 h1 1 h2 we get by trying out each h F in turn the training points are chosen to maximize this number VC dimension is the largest number of points we can shatter i e generate all possible labelings of the points Tommi Jaakkola MIT AI Lab 3 Learning and VC dimension We don t really learn anything until after we have more than dV C training examples 160 140 log2 of labelings 120 100 80 60 40 20 0 0 50 100 150 of training examples 200 The number of labelings that the set of classifiers can generate over n points increases sub exponentially after n dV C in this case dV C 100 Tommi Jaakkola MIT AI Lab 4 Learning and VC dimension When the VC dimension is finite the probability over the choice of the training set that we would find any h F for which the difference Empirical loss z Expected loss n X z 1 Loss yi h xi E Loss y h x n i 1 is large goes down exponentially fast as a function of the size of the training set n Here Loss y h x 1 if y 6 h x and zero otherwise so called zero one loss This result holds for any underlying probability distribution from which the examples and the labels are generated Tommi Jaakkola MIT AI Lab 5 Extensions complexity and margin The number of possible labelings of points with large margin can be dramatically less than the basic VC dimension o o o o o o o x x x o o x x x x x The set of separating hyperplaces which attain margin or better for examples within a sphere of radius R has VC dimension bounded by dV C R2 2 Tommi Jaakkola MIT AI Lab 6 Model selection We try to find the model with the best balance of complexity and the fit to the training data Ideally we would select a model from a nested sequence of models of increasing complexity Model 1 d1 Model 2 d2 Model 3 d3 where d1 d2 d3 Basic model selection criterion Criterion empirical score Complexity penalty Tommi Jaakkola MIT AI Lab 7 Structural risk minimization In structural risk minimization we define the models in terms of VC dimension or refinements Model 1 dV C d1 Model 2 dV C d2 Model 3 dV C d3 where d1 d2 d3 The selection criterion lowest upper bound on the expected loss Expected loss Empirical loss Complexity penalty Tommi Jaakkola MIT AI Lab 8 Example Models of increasing complexity Model 1 K x1 x2 1 xT1 x2 Model 2 K x1 x2 1 xT1 x2 2 Model 3 K x1 x2 1 xT1 x2 3 These are nested i e F1 F2 F3 where Fk refers to the set of possible decision boundaries that the model k can represent Still need to derive the criterion Tommi Jaakkola MIT AI Lab 9 Bounds on expected loss For simplicity let s look at a single fixed classifier h x and n training points n Expected loss With probability at least 1 over the choice of the training set Empirical loss z sampling penalty Expected loss n z z 1X E Loss y h x Loss yi h xi n n i 1 For the bound to be valid uniformly for all classifiers in the set F we have to include the VC dim Tommi Jaakkola MIT AI Lab 10 Structural risk minimization Finite VC dimension gives us some guarantees about how close the empirical loss is to the expected loss With probability at least 1 over the choice of the training set for all h Fk Empirical loss z Complexity penalty Expected loss n z z 1X E Loss y h x Loss yi h xi n dk n i 1 where dk VC dimension of model set of hypothesis k Confidence parameter probability of failure We find model k that has the lowest bound on the expected loss Tommi Jaakkola MIT AI Lab 11 Structural risk minimization cont d For our zero one loss classification error we can derive the following complexity penalty Vapnik 1995 r dV C log 2n dV C 1 log 1 4 n d n 1 This is an increasing function of dV C 2 Increases as decreases 3 Decreases as a function of n this is not the only choice Tommi Jaakkola MIT AI Lab 12 Structural risk minimization cont d Competition of terms 1 Empirical loss decreases with increasing dV C 2 Complexity penalty increases with increasing dV C 1 0 9 0 8 0 7 0 6 0 5 0 4 Model score 0 3 Complexity penalty 0 2 Empirical fit 0 1 0 0 10 20 30 40 50 We find the minimum of the model score bound Tommi Jaakkola MIT AI Lab 13 Structural risk minimization example 2 2 1 5 1 5 1 1 0 5 0 5 0 0 0 5 0 5 1 1 5 1 0 5 0 0 5 1 1 5 2 1 1 5 2 2 1 5 1 5 1 1 0 5 0 5 0 0 0 5 0 5 1 0 5 0 0 5 1 1 5 4th order polynomial Tommi Jaakkola MIT AI Lab 0 5 0 0 5 1 1 5 2 2nd order polynomial linear 1 1 5 1 2 1 1 5 1 0 5 0 0 5 1 1 5 2 8th order polynomial 14 Structural risk minimization example cont d Number of 0 05 Model 1st order 2nd order 4th order 8th order training examples n 50 confidence parameter dV C 3 6 15 45 Empirical fit Complexity penalty n dV C 0 06 0 5501 0 06 0 6999 0 04 0 9494 0 02 1 2849 Structural risk minimization would select the simplest linear model in this case Tommi Jaakkola MIT AI Lab 15 Topics Complexity compression and model selection description length minimum description length principle Tommi Jaakkola MIT AI Lab 16 Model selection and data compression We can alternatively view model selection as a problem of finding the best way of communicating the available data We have to communicate both the data and the method that we used to compress the data The communication cost in bits depends on how well the model can predict the data as well as how hard it is to describe the model itself complexity Total of bits bits to describe the data given the model bits to describe the model Tommi Jaakkola MIT AI Lab 17 Description length How many …

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT 6 867 - Lecture Notes

Sign up for free to view:

Please select your school