EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009Lecture 14 — March 9Lecturer: Martin Wainwright Scribe: Nicholas HayNote: These lecture notes a re still rough, and have only have been mildly proofread.14.1 More on shatter i ng and VC di mensionGiven a class A of subsets, its shattering coefficients are given bys(A, n) = maxz1,...,zncard {A ∩ {z1, . . . , zn} | A ∈ A}and its VC dimension by VA= sup{n | s(A, n) = 2n}.Example: The class of one dimensional half spaces A1= {(−∞, a] | a ∈ R} has s(A1, n) =n + 1 and so VA1= 1. The class of half open intervals A2= {(b, a] | b < a ∈ R} hass(A1, n) =n(n+1)2+ 1 and so VA2= 2.Recall f r om previous lectures:Theorem 14.1 ( GC). Given any class of sets APsupA∈A|ˆPn(A) − P(A)| > ǫ≤ 8 s(A, n) exp−nǫ232whereˆPn(A) =1nPni=1I(z(i)∈ A) for iid samples Z(i)for i = 1, . . . , n.VC dimension and shatter coefficients are closely connected:1. If VA= ∞ then s(A, n) = 2nfor a ll n,2. If VA< ∞ then s(A, n) ≤ (n + 1)VAfor a ll n.The first is by definition, the second as a corollary of the following lemma.Lemma 14.2 ( Sauer). If A be a class with finite VC dimension VA, thens(A, n) ≤VAXi=0ni.14-1EECS 281B / STAT 241B Lecture 14 — March 9 Spring 2009Given this, we can derive the (weak) upper bounds(A, n) ≤VAXi=0n!i!(n − i)!≤VAXi=0ni1i!≤VAXi=0niVAi= (n + 1)VASo far we’ve computed t he VC dimension of classes case-by-case. We want systematic waysto upper bound the VC dimension. The following proposition is the first.Proposition 14.3. Let G be a finite-dimensional vector space of functions o n Rd. Then theclass of setsAG=n{x | g(x) ≥ 0} | g ∈ Gohas VC dimension at most dim G.Proof: We will show that no subset of Rdof size n = dim G + 1 can be shattered by AG.Fix n points x(1), . . . , x(n)∈ Rd. Consider the map L: G → Rndefined byL(g) = (g(x(1)), . . . , g(x(n))).This map is linear, and so its range is a linear subspace of Rnof dimension at most dim G.Since n > dim G there must exist a nonzero vector γ ∈ Rnorthogonal to this subspace, i.e.such thatnXi=1γig( x(i)) = 0 (14.1)for all g ∈ G. Without loss of generality suppose γi< 0 for some i, and observe thatequation (14 .1 ) is equiva lent toX{i|γi≥0}γig( x(i)) =X{i|γi<0}−γig( x(i)) (14.2)for a ll g ∈ G.Now proceed via proof by contradiction: suppose that x(1), . . . , x(n)can be shattered byA. Then there must exist g ∈ G such that{x | g(x) ≥ 0} = {i | γi≥ 0} .But with this choice of g the LHS of equation 1 4.2 must be nonnegative, whilst the RHSmust be negative (since γi< 0 for some i), which is a contradiction. So we conclude that nosubset of size n of Rdcan be shattered. 14-2EECS 281B / STAT 241B Lecture 14 — March 9 Spring 2009Example: Consider the set of half spacesA =n{x ∈ Rd| aTx ≥ b} | for some a ∈ Rdand b ∈ Ro.This class is of the form required for proposition 14.3, we need only compute the dimensionof the underlying vector space of functions. This is seen to be d + 1 by the following basis:g0(x) = 1gi(x) = xifor i = 1, . . . , dSo VA≤ d + 1.14.2 Application to binary classificationSuppose we are learning binary classifiers f : Rd→ {−1, +1} of the formF =(f = sgn(g) | g(x) = a0+dXi=1aixi, ai∈ R).From the previous example we have VF≤ d + 1. Define the optimal linear risk to beR∗F= inff∈FR(f) = inff∈FP(Y 6= f(X)).Supposeˆfnis selected to minimize the empirical risk given iid samples (x(i), y(i)) for i =1, . . . , n:ˆfn∈ a rgminfˆRn(f) = argminf1nnXi=1I(y(i)6= f(x(i))).Corollary 14.4. For all n ∈ N, ǫ > 0 with nǫ2> 2, the error probability of the empiricallyoptimal classifierˆfnsatisfiesPhR(ˆfn) − R∗F> ǫi≤ 8(n + 1)d+1exp−nǫ2128.Note thatˆfnis a random classifier: it depends on the particular n iid samples usedto train it. The mild condition nǫ2> 2 is required for the GC theorem that we use inproving this corollary (see Step 1 [symmetrization] in proof of GC theorem). This is no realrestriction since we care about the behaviour of this bound as n tends to infinity for fixed ǫ.Proof: Observe we can decompose t he error into two termsR(ˆfn) − R∗F= R (ˆfn) − inff∈FR(f)= [R(ˆfn) −ˆRn(ˆfn)] + [ˆRn(ˆfn) − inff∈FR(f)]14-3EECS 281B / STAT 241B Lecture 14 — March 9 Spring 2009The first term is easily boundedR(ˆfn) −ˆRn(ˆfn) ≤ supf∈FR(f) −ˆRn(f)For the second, observe that for any f ∈ F we can uniformly boundˆRn(ˆfn) − R(f) ≤ˆRn(ˆfn) − R(f)≤ supf∈FˆRn(f) − R(f)This bounds the second t ermˆRn(ˆfn) − inff∈FR(f) = supˆRn(ˆfn) − R(f)≤ supf∈FˆRn(f) − R(f)Combining the above with theorem 14.1, lemma 14.2, and the bound on the VC dimensionof F we havePhR(ˆfn) − R∗F> ǫi≤ Psupf∈FˆRn(f) − R(f )> ǫ/2≤ 8s(A, n) exp−nǫ2128≤ 8(n + 1)VFexp−nǫ2128≤ 8(n + 1)d+1exp−nǫ2128 The above result can equivalently stated in terms of bounds on expectations:Corollary 14.5. Under t he same conditions as corollary 14.4EhR(ˆfn) − R∗Fi≤ 16rlog 8e s(F, n)2n= O rlog s(F, n)n!If the VF< +∞ t henEhR(ˆfn) − R∗Fi= O rVFlog nn!.This corollary follows by a careful int egration of the tail bound, as we will discuss in thenext
View Full Document