DOC PREVIEW
MIT 6 867 - Lecture Notes

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

� 1 6.867 Machine learning, lecture 14 (Jaakkola) Lecture topics: • margin and generalization – linear classifiers – ensembles mixture models • Margin and generalization: linear classifiers As we increase the number of data points, any set of classifiers we are considering may no longer be able to label the points in all possible ways. Such emerging constraints are critical to be able to predict labels for new points. This motivates a key measure of complexity of the set of classifiers, the Vapnik-Chervonenkis dimension. The VC-dimension is defined as the maximum number of points that a classifier can shatter. The VC-dimension of linear classifiers on the plane is three (see previous lecture). Note that the definition involves a maximum over the possible points. In other words, we may find le ss than dV C points that the set of classifiers cannot shatter (e.g., linear classifiers with points exactly on a line in 2 − d) but there cannot be any set of more than dV C points that the classifier can shatter. The VC-dimension of the set of linear classifiers in d−dimensions is d+1, i.e., the number of parameters. This relation to the number of parameters is typical alb eit certainly not always true (e.g., the VC-dimension may be infinite for a classifier with a single real parameter!). The VC-dimension immediately generalizes our previous results for bounding the expected error from a finite number of classifiers. There are a numb er of technical steps involved that we won’t get into, however. Loosely speaking, dV C takes the place of the logarithm of the number of classifiers in our set. In other words, we are counting the number of classifiers on the basis of how they can label points, not based on their identities in the set. More precisely, we have for any set of classifiers F: with probability at least 1 − δ over the choice of the training set, R(f) ≤ Rn(f) + �(n, dV C , δ), uniformly for all f ∈ F (1) where the complexity penalty is now a function of dV C = dV C (F): dV C (log(2n/dV C ) + 1) + log(1/(4δ))�(n, dV C , δ) = (2) n Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� 2 6.867 Machine learning, lecture 14 (Jaakkola) The result is problematic for kernel methods. For example, the VC-dimension of kernel classifiers with the radial basis kernel is ∞. We can, however, incorporate the notion of margin in the classifier “dimension”. One such definition is Vγ dimension that measures the VC-dimension with the constraint that distinct labelings have to be obtained with a fixed margin γ. Suppose all the examples fall within an enclosing sphere of radius R . Then, as we increase γ, there will be very few examples we can classify in all possible ways with this constraint (especially when γ R; cf. Figure 1). Put another way, the VC-→dimension of a set of linear classifiers required to attain a prescribed margin can be much lower (decreasing as a function of the margin). In fact, Vγ dimension for linear classifiers is bounded by R2/γ2, i.e., inversely proportional to the squared margin. Note that this result is independent of the dimension of input examples, and is exactly the mistake bound for the perceptron algorithm! xoooooooooxxxxxxxFigure 1: The set of linear classifiers required to obtain a specific geometric margin has a lower VC-dimension when the examples remain within an enclosing sphere. The previous generalization guarantees can be us ed with Vγ dimension as well so long as we replace the training error with margin violations, i.e., we count the fraction of examples that cannot be separated with margin at least γ. Margin and generalization: ensembles An ensemble classifier can be written as a convex combination of simpler base classifiers mhm(x) = αj h(x; θj ) (3) j=1 Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].� � 3 6.867 Machine learning, lecture 14 (Jaakkola) where αj ≥ 0 and m αj = 1. Boosting generates such ensembles but does not normalized j=1 the coefficients αj to sum to one. The normalization can be easily performed after the fact. We are interested in understanding the complexity of such ensembles and how generalization guarantees depends on the voting margin achieved, e.g., through boosting algorithm. Note that our discussion here will not refer to how such ensembles are generated. Let’s start by defining what the ensembles are not. They are not decision trees. A decision (classification) tree is a method of recursively partitioning the set of examples into regions such that within each region the examples would have as uniform labels as possible. The partitioning in a decision tree could be based on the same type of decision stumps as we have used for the ensemble. In the ensemble, however, the domain for all the stumps is the whole space. In other words, you cannot restrict the application of the stump within a specific partition. In the ensemble, each stump contributes to the classification of all the examples. How powerful are ensembles based on the decision stumps? To understand this further let’s show how we can shatter any n points with 2n stumps even in one dimensions. It suffices to show that we can find an ensemble with 2n stumps that reproduces any specific labeling y1, . . . , yn of n points x1, . . . , xn (now real numbers). To do so, we will construct an ensemble of two stumps to repro duce the label yt for xt but without affecting the classification of other training e xamples. If � is less than the smallest distance between any two training examples, then 1 1 hpair(x; xt, yt) = sign (yt(x − xt + �)) + sign (−yt(x − xt − �)) (4) 2 2has value yt within interval [xt − �, xt + �] is zero everywhere else. Thus, setting αt = 1/n, h2n(x) = αthpair(x; xt, yt) (5) t=1 has the correct sign for all the training examples. T he ensemble of 2n components therefore has VC-dimension at least n. Ensembles are powerful as classifiers in this sense and their VC-dimension poorly explains their success in practice. Each example in the above construction only has a very low voting


View Full Document

MIT 6 867 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?