6 867 Machine learning lecture 14 Jaakkola 1 Lecture topics margin and generalization linear classi ers ensembles mixture models Margin and generalization linear classi ers As we increase the number of data points any set of classi ers we are considering may no longer be able to label the points in all possible ways Such emerging constraints are critical to be able to predict labels for new points This motivates a key measure of complexity of the set of classi ers the Vapnik Chervonenkis dimension The VC dimension is de ned as the maximum number of points that a classi er can shatter The VC dimension of linear classi ers on the plane is three see previous lecture Note that the de nition involves a maximum over the possible points In other words we may nd less than dV C points that the set of classi ers cannot shatter e g linear classi ers with points exactly on a line in 2 d but there cannot be any set of more than dV C points that the classi er can shatter The VC dimension of the set of linear classi ers in d dimensions is d 1 i e the number of parameters This relation to the number of parameters is typical albeit certainly not always true e g the VC dimension may be in nite for a classi er with a single real parameter The VC dimension immediately generalizes our previous results for bounding the expected error from a nite number of classi ers There are a number of technical steps involved that we won t get into however Loosely speaking dV C takes the place of the logarithm of the number of classi ers in our set In other words we are counting the number of classi ers on the basis of how they can label points not based on their identities in the set More precisely we have for any set of classi ers F with probability at least 1 over the choice of the training set R f Rn f n dV C uniformly for all f F where the complexity penalty is now a function of dV C dV C F dV C log 2n dV C 1 log 1 4 n dV C n 1 2 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 14 Jaakkola 2 The result is problematic for kernel methods For example the VC dimension of kernel classi ers with the radial basis kernel is We can however incorporate the notion of margin in the classi er dimension One such de nition is V dimension that measures the VC dimension with the constraint that distinct labelings have to be obtained with a xed margin Suppose all the examples fall within an enclosing sphere of radius R Then as we increase there will be very few examples we can classify in all possible ways with this constraint especially when R cf Figure 1 Put another way the VC dimension of a set of linear classi ers required to attain a prescribed margin can be much lower decreasing as a function of the margin In fact V dimension for linear classi ers is bounded by R2 2 i e inversely proportional to the squared margin Note that this result is independent of the dimension of input examples and is exactly the mistake bound for the perceptron algorithm o o o o o o o x x x o o x x x x x Figure 1 The set of linear classi ers required to obtain a speci c geometric margin has a lower VC dimension when the examples remain within an enclosing sphere The previous generalization guarantees can be used with V dimension as well so long as we replace the training error with margin violations i e we count the fraction of examples that cannot be separated with margin at least Margin and generalization ensembles An ensemble classi er can be written as a convex combination of simpler base classi ers hm x m j h x j 3 j 1 Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 14 Jaakkola 3 where j 0 and m j 1 j 1 Boosting generates such ensembles but does not normalized the coe cients j to sum to one The normalization can be easily performed after the fact We are interested in understanding the complexity of such ensembles and how generalization guarantees depends on the voting margin achieved e g through boosting algorithm Note that our discussion here will not refer to how such ensembles are generated Let s start by de ning what the ensembles are not They are not decision trees A decision classi cation tree is a method of recursively partitioning the set of examples into regions such that within each region the examples would have as uniform labels as possible The partitioning in a decision tree could be based on the same type of decision stumps as we have used for the ensemble In the ensemble however the domain for all the stumps is the whole space In other words you cannot restrict the application of the stump within a speci c partition In the ensemble each stump contributes to the classi cation of all the examples How powerful are ensembles based on the decision stumps To understand this further let s show how we can shatter any n points with 2n stumps even in one dimensions It su ces to show that we can nd an ensemble with 2n stumps that reproduces any speci c labeling y1 yn of n points x1 xn now real numbers To do so we will construct an ensemble of two stumps to reproduce the label yt for xt but without a ecting the classi cation of other training examples If is less than the smallest distance between any two training examples then 1 1 hpair x xt yt sign yt x xt sign yt x xt 2 2 4 has value yt within interval xt xt is zero everywhere else Thus setting t 1 n h2n x t hpair x xt yt 5 t 1 has the correct sign for all the training examples The ensemble of 2n components therefore has VC dimension at least n Ensembles are powerful as classi ers in this sense and their VC dimension poorly explains their success in practice Each example in the above construction only has a very low voting margin 1 n however Perhaps we can similarly re ne the analysis to incorporate the voting margin as we did above with linear classi ers and the geometric margin The key idea is to reduce an ensemble with many components to a coarse ensemble with few components but one that nevertheless classi es the examples in the same way When the original ensemble achieves a large voting margin this is indeed possible and the size …
View Full Document
Unlocking...