UCLA COMSCI 260 - Infinite Function Classes and VC Dimension - D2205590

Home> Schools> University of California, Los Angeles> Computer Science (COMSCI) > COMSCI 260> Infinite Function Classes and VC Dimension

DOC PREVIEW

UCLA COMSCI 260 - Infinite Function Classes and VC Dimension

School name University of California, Los Angeles

Course Comsci 260- Machine Learning Theory

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS260: Machine Learning TheoryLecture 5: Infinite Function Classes and VC DimensionOctober 10, 2011Lecturer: Jennifer Wortman Vaughan1 Infinite Function ClassesThe general learning bounds that we proved in the last few lectures hold for any finite class H. In thislecture, we will consider the case in which H is infinite. We have already seen a couple examples of aninfinite function classes: the class of one-dimensional threshold functions we analyzed in class, and theclass of axis-aligned rectangles that you analyzed in the first homework assignment.To handle infinite function classes, we need to replace the ln |H| term in our bounds with some othernotion of complexity. To gain some intuition, we begin with a “bad” argument of how this might be done.Suppose we have an H in which each function is parameterized by k real numbers. For example, an n-dimensional axis-aligned rectangle has k = 2n parameters. An n-dimensional linear threshold function hask = n + 1 parameters. Suppose we would like to store a representation of such a function on a computerwith finite memory. If we use b bits to represent each real valued parameter, it takes a total of kb bits torepresent a function. Our hypothesis class would then effectively have 2kbdifferent hypotheses.If we substitute this value into our sample complexity bound, we observe that the number of trainingexamples we need to guarantee low error with high probability is O1ǫ(k + ln(1/δ)). In this case, thenumber of training samples needed is linear in the number of parameters.While this argument provides a decent heuristic (complexity of a class is roughly equal to the numberof parameters), the argument we have used here is not completely satisfying. In this lecture, we will discussmore accurate ways to measure the complexity of a concept class.In everything that follows, we will assume that we are working in a model of computation in which wecan store and manipulate real numbers in constant space and time. This is crucial if we want to say anythingabout efficient algorithms in this setting.1.1 The Growth FunctionLet S be a vector of m (arbitrary) examples x1, . . . , xm. Given h ∈ H we define h(S) = hh(x1), . . . , h(xm)ito be the behavior of h on the examples. There might be other h′∈ H with identical behavior on thesepoints, that is, other h′such that h(S) = h′(S). The behavior set of H on S, denoted ΠH(S), is the set ofall possible behaviors of functions h ∈ H on the set S, that isΠH(S) = {h(S)|h ∈ H}.For binary classification, we have |ΠH(S)| ≤ 2mfor all sets S of size m.All CS260 lecture notes build on the scribes’ notes written by UCLA students in the Fall 2010 offering of this course. Althoughthey have been carefully reviewed, it is entirely possible that some of them contain errors. If you spot an error, please email Jenn.1The growth function of H is defined asΠH(m) = max{S:|S|=m}|ΠH(S)|.For any H, it measures the maximum number of different ways that functions in H can behave on any set ofpoints of a particular size. Note that |ΠH(m)| ≤ 2m.Let’s look at some examples to get intuition.Example 1. Let H be the class of one-dimensional threshold functions. Given a single point in (0, 1), thereare two ways that functions in H can label the point, so ΠH(1) = 2. Given two points, there are three waysthat functions in H can label them:− −− ++ +so ΠH(2) = 3. Given 3 distinct points in (0, 1), there are 4 different possibilities for h(S):− − −− − +− + ++ + +so ΠH(3) = 4. In general, if we have m points, ΠH(m) = m + 1. Notice that this is significantly smallerthan the general bound 2m.Example 2. Let H be the class of one-dimensional interval functions. Each function in the class is param-eterized by two threshold values, a lower threshold (call this l) and an upper threshold (call this u). A pointx is labeled positive if x ∈ [l, u] and labeled negative otherwise.Figure 1: IntervalsIf we are given m distinct points in (0, 1). How many different behaviors can we observe? To answerthis question, it doesn’t matter where exactly the interval boundaries lie, what matters is which pairs ofpoints they lie between. Our m points define m + 1 regions in (0, 1). Then the number of different behaviorsequals them+12different ways we can choose these regions plus 1 (which is the case obtained if the twoboundaries are in the same region and so all the points are labeled negative), which is O(m2). This is againfar less than the pessimistic exponential bound.2It turns out that the growth function can be used to generate an error bound for infinite function classes.Note that we are back to considering the realizable setting in which there is a perfect target function. Wewill not discuss the proof of this result in class, but it is given in Chapter 3 of Kearns and Vazirani.Theorem 1. Consider any concept classes C and H for input space X . Suppose we have an algorithm Athat for any target function c ∈ C, given a sample x1, . . . , xm∈ X labeled by c, will return a functionh ∈ H consistent with this data. Then for any distribution D on X , for any c ∈ C, for any δ ∈ (0, 1/2), ifA is run on a sample of m points drawn i.i.d. from D and labeled by c, then with probability at least 1 − δ,err(h) ≤ kln(ΠH(2m)) + ln(2/δ)mfor a small constant k.This bound is meaningless if the growth function is 2m. However, as we have seen, it is often muchsmaller. The problem with this bound is that the growth function is a tricky quantity to calculate in general.For this reason, it is useful to introduce another notion of complexity, the VC dimension.2 VC DimensionWe say that a set S = {x1, . . . , xm} of size m is shattered by the class H if all possible labelings of Sare achievable by some h ∈ H. Formally, H shatters S if |ΠH(S)| = 2m. The VC dimension of H is thecardinality of the largest set S that can be shattered by H.The VC dimension is convenient because it can be calculated for many of classes of interest. To provethat the VC dimension of a class H is d, it is necessary to 1) give an example of a set of d points that can beshattered, and 2) prove that no set of d + 1 points can be shattered.Example 3. Once again, let H be the class of one-dimensional thresholds. It is easy to see that 1 point canbe shattered, but 2 cannot. Thus, V C(H) = 1.Example 4. Let H be the class of one-dimensional intervals. We observe that 2 points can be shattered, but3 cannot. (See Figure 2.) Thus, V

View Full Document