Machine learning lecture 12 Topics Complexity and model selection structural risk minimization Tommi S Jaakkola MIT CSAIL tommi csail mit edu Complexity compression and model selection description length minimum description length principle Tommi Jaakkola MIT CSAIL VC dimension review 2 The number of labelings Shattering A set of classifiers F e g linear classifiers is said to shatter n points x1 xn if for any possible configuration of labels y1 yn we can find h F that reproduces those labels 160 140 log2 of labelings 120 VC dimension The VC dimension of a set of classifiers F is the largest number of points that F can shatter maximized over the choice of the n points Learning We don t expect to learn anything until we have more than dV C training examples and labels this statement will be refined later on 100 80 60 40 20 0 0 n dV C n dV C Tommi Jaakkola MIT CSAIL 3 50 100 150 of training examples 200 of labelings 2n dV C en of labelings dV C Tommi Jaakkola MIT CSAIL 4 Learning and VC dimension Model selection By essentially replacing log M in the finite case with the log of the number of possible labelings by the set of classifiers over n really 2n points we get an analogous result We try to find the model with the best balance of complexity and fit to the training data Theorem With probability at least 1 over the choice of the training set for all h F E h E n h n dV C where n dV C Ideally we would select a model from a nested sequence of models of increasing complexity VC dimension Model 1 F1 VC dim d1 Model 2 F2 VC dim d2 Model 3 F3 VC dim d3 where F1 F2 F3 dV C log 2n dV C 1 log 1 4 n Model selection criterion find the model set of classifiers that achieves the lowest upper bound on the expected loss generalization error Expected error Training error Complexity penalty Tommi Jaakkola MIT CSAIL 5 Tommi Jaakkola MIT CSAIL 6 Structural risk minimization Example We choose the model class Fi that minimizes the upper bound on the expected error di log 2n di 1 log 1 4 E h i E n h i n where h i is the classifier from Fi that minimizes the training error 1 Models of increasing complexity Model 1 K x1 x2 1 xT1 x2 Model 2 K x1 x2 1 xT1 x2 2 Model 3 K x1 x2 1 xT1 x2 3 These are nested i e F1 F2 F3 0 9 0 8 where Fk refers to the set of possible decision boundaries that the model k can represent 0 7 0 6 0 5 0 4 Bound 0 3 Complexity penalty 0 2 0 1 Training error 0 0 10 20 30 VC dimension 40 50 Tommi Jaakkola MIT CSAIL 7 Structural risk minimization example 2 2 1 5 1 5 1 1 0 5 0 5 0 0 0 5 0 5 1 1 5 1 0 5 0 0 5 1 1 5 2 1 1 5 2 2 1 5 1 5 1 1 0 5 0 5 0 0 0 5 0 5 1 1 5 4 th 1 0 5 0 1 0 5 0 0 5 1 1 5 0 5 1 1 5 2 order polynomial 1 1 5 2 training examples n 50 confidence parameter 1 0 5 0 0 5 1 1 5 2 Tommi Jaakkola MIT CSAIL Complexity and margin o 10 Topics The number of possible labelings of points with large margin can be dramatically less than the basic VC dimension would imply o Empirical fit n dV C 0 06 0 5501 0 06 0 6999 0 04 0 9494 0 02 1 2849 8th order polynomial 9 o dV C 3 6 15 45 Structural risk minimization would select the simplest linear model in this case Tommi Jaakkola MIT CSAIL o 8 Structural risk minimization example cont d Number of 0 05 Model 1st order 2nd order 4th order 8th order 2nd order polynomial linear Tommi Jaakkola MIT CSAIL Complexity and model selection structural risk minimization Complexity compression and model selection description length minimum description length principle o o o x x x o o x x x x x The set of separating hyperplaces which attain margin or better for examples within a sphere of radius R has VC dimension bounded by dV C R2 2 Tommi Jaakkola MIT CSAIL 11 Tommi Jaakkola MIT CSAIL 12 Data compression and model selection Data compression and model selection We can alternatively view model selection as a problem of finding the best way of communicating the available data We can alternatively view model selection as a problem of finding the best way of communicating the available data Compression and learning Compression and learning y1 y2 yn y1 y2 x1 x2 xn x1 x2 xn x1 x2 yn xn x1 x2 xn h x Tommi Jaakkola MIT CSAIL 13 Tommi Jaakkola MIT CSAIL 14 Data compression and model selection Data compression and model selection We can alternatively view model selection as a problem of finding the best way of communicating the available data We can alternatively view model selection as a problem of finding the best way of communicating the available data Compression and learning Compression and learning y1 y2 x2 y1 y2 h x x1 yn xn yn h x x1 x2 xn Tommi Jaakkola MIT CSAIL x1 15 x2 x1 x2 xn h x xn Tommi Jaakkola MIT CSAIL 16 Data compression and model selection Data compression and model selection We can alternatively view model selection as a problem of finding the best way of communicating the available data We can alternatively view model selection as a problem of finding the best way of communicating the available data Compression and learning Compression and learning y1 y2 yn x2 h x x1 errors xn y1 y2 h x x1 x2 yn x1 x2 x1 x2 h x xn errors xn h x xn The receiver already knows input examples models we consider Need to communicate model class parameter estimates prediction errors Tommi Jaakkola MIT CSAIL 17 Tommi Jaakkola MIT CSAIL 18 Compression and sequential estimation Compression and sequential estimation We don t have to communicate any real valued parameters if we setup the learning problem sequentially We don t have to communicate any real valued parameters if we setup the learning problem sequentially y1 y2 yn y1 y2 yn xn x1 x2 xn h x 0 x1 x2 xn x1 x2 xn x1 x2 0 default parameter values Tommi Jaakkola MIT CSAIL 19 Tommi Jaakkola MIT CSAIL 20 Compression and sequential estimation Compression and sequential estimation We don t have to communicate any real valued parameters if we setup the learning problem sequentially We don t have to communicate any real valued parameters if we setup the learning problem sequentially y1 y2 yn y1 y2 h x 1 x1 x2 yn xn x1 x2 xn h x 1 xn x1 x2 xn x1 0 default parameter …
View Full Document
Unlocking...