DOC PREVIEW
Berkeley STAT C241B - Lecture 14

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009Lecture 14 — March 9Lecturer: Martin Wainwright Scribe: Nicholas HayNote: These lecture notes a re still rough, and have only have been mildly proofread.14.1 More on shatter i ng and VC di mensionGiven a class A of subsets, its shattering coefficients are given bys(A, n) = maxz1,...,zncard {A ∩ {z1, . . . , zn} | A ∈ A}and its VC dimension by VA= sup{n | s(A, n) = 2n}.Example: The class of one dimensional half spaces A1= {(−∞, a] | a ∈ R} has s(A1, n) =n + 1 and so VA1= 1. The class of half open intervals A2= {(b, a] | b < a ∈ R} hass(A1, n) =n(n+1)2+ 1 and so VA2= 2.Recall f r om previous lectures:Theorem 14.1 ( GC). Given any class of sets APsupA∈A|ˆPn(A) − P(A)| > ǫ≤ 8 s(A, n) exp−nǫ232whereˆPn(A) =1nPni=1I(z(i)∈ A) for iid samples Z(i)for i = 1, . . . , n.VC dimension and shatter coefficients are closely connected:1. If VA= ∞ then s(A, n) = 2nfor a ll n,2. If VA< ∞ then s(A, n) ≤ (n + 1)VAfor a ll n.The first is by definition, the second as a corollary of the following lemma.Lemma 14.2 ( Sauer). If A be a class with finite VC dimension VA, thens(A, n) ≤VAXi=0ni.14-1EECS 281B / STAT 241B Lecture 14 — March 9 Spring 2009Given this, we can derive the (weak) upper bounds(A, n) ≤VAXi=0n!i!(n − i)!≤VAXi=0ni1i!≤VAXi=0niVAi= (n + 1)VASo far we’ve computed t he VC dimension of classes case-by-case. We want systematic waysto upper bound the VC dimension. The following proposition is the first.Proposition 14.3. Let G be a finite-dimensional vector space of functions o n Rd. Then theclass of setsAG=n{x | g(x) ≥ 0} | g ∈ Gohas VC dimension at most dim G.Proof: We will show that no subset of Rdof size n = dim G + 1 can be shattered by AG.Fix n points x(1), . . . , x(n)∈ Rd. Consider the map L: G → Rndefined byL(g) = (g(x(1)), . . . , g(x(n))).This map is linear, and so its range is a linear subspace of Rnof dimension at most dim G.Since n > dim G there must exist a nonzero vector γ ∈ Rnorthogonal to this subspace, i.e.such thatnXi=1γig( x(i)) = 0 (14.1)for all g ∈ G. Without loss of generality suppose γi< 0 for some i, and observe thatequation (14 .1 ) is equiva lent toX{i|γi≥0}γig( x(i)) =X{i|γi<0}−γig( x(i)) (14.2)for a ll g ∈ G.Now proceed via proof by contradiction: suppose that x(1), . . . , x(n)can be shattered byA. Then there must exist g ∈ G such that{x | g(x) ≥ 0} = {i | γi≥ 0} .But with this choice of g the LHS of equation 1 4.2 must be nonnegative, whilst the RHSmust be negative (since γi< 0 for some i), which is a contradiction. So we conclude that nosubset of size n of Rdcan be shattered. 14-2EECS 281B / STAT 241B Lecture 14 — March 9 Spring 2009Example: Consider the set of half spacesA =n{x ∈ Rd| aTx ≥ b} | for some a ∈ Rdand b ∈ Ro.This class is of the form required for proposition 14.3, we need only compute the dimensionof the underlying vector space of functions. This is seen to be d + 1 by the following basis:g0(x) = 1gi(x) = xifor i = 1, . . . , dSo VA≤ d + 1.14.2 Application to binary classificationSuppose we are learning binary classifiers f : Rd→ {−1, +1} of the formF =(f = sgn(g) | g(x) = a0+dXi=1aixi, ai∈ R).From the previous example we have VF≤ d + 1. Define the optimal linear risk to beR∗F= inff∈FR(f) = inff∈FP(Y 6= f(X)).Supposeˆfnis selected to minimize the empirical risk given iid samples (x(i), y(i)) for i =1, . . . , n:ˆfn∈ a rgminfˆRn(f) = argminf1nnXi=1I(y(i)6= f(x(i))).Corollary 14.4. For all n ∈ N, ǫ > 0 with nǫ2> 2, the error probability of the empiricallyoptimal classifierˆfnsatisfiesPhR(ˆfn) − R∗F> ǫi≤ 8(n + 1)d+1exp−nǫ2128.Note thatˆfnis a random classifier: it depends on the particular n iid samples usedto train it. The mild condition nǫ2> 2 is required for the GC theorem that we use inproving this corollary (see Step 1 [symmetrization] in proof of GC theorem). This is no realrestriction since we care about the behaviour of this bound as n tends to infinity for fixed ǫ.Proof: Observe we can decompose t he error into two termsR(ˆfn) − R∗F= R (ˆfn) − inff∈FR(f)= [R(ˆfn) −ˆRn(ˆfn)] + [ˆRn(ˆfn) − inff∈FR(f)]14-3EECS 281B / STAT 241B Lecture 14 — March 9 Spring 2009The first term is easily boundedR(ˆfn) −ˆRn(ˆfn) ≤ supf∈FR(f) −ˆRn(f)For the second, observe that for any f ∈ F we can uniformly boundˆRn(ˆfn) − R(f) ≤ˆRn(ˆfn) − R(f)≤ supf∈FˆRn(f) − R(f)This bounds the second t ermˆRn(ˆfn) − inff∈FR(f) = supˆRn(ˆfn) − R(f)≤ supf∈FˆRn(f) − R(f)Combining the above with theorem 14.1, lemma 14.2, and the bound on the VC dimensionof F we havePhR(ˆfn) − R∗F> ǫi≤ Psupf∈FˆRn(f) − R(f )> ǫ/2≤ 8s(A, n) exp−nǫ2128≤ 8(n + 1)VFexp−nǫ2128≤ 8(n + 1)d+1exp−nǫ2128 The above result can equivalently stated in terms of bounds on expectations:Corollary 14.5. Under t he same conditions as corollary 14.4EhR(ˆfn) − R∗Fi≤ 16rlog 8e s(F, n)2n= O rlog s(F, n)n!If the VF< +∞ t henEhR(ˆfn) − R∗Fi= O rVFlog nn!.This corollary follows by a careful int egration of the tail bound, as we will discuss in thenext


View Full Document

Berkeley STAT C241B - Lecture 14

Download Lecture 14
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 14 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 14 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?