CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 18ΠF(n) for parameterized F , covering numbers, Rn(F )Lecturer: Peter Bartlett Scribe: Eilyan Bitar1 ΠF(n) for parameterized FF = {x → sign(f(a, x)) | a ∈ Rd, f : Rd× R → R}For a family of classifiers F , linear in a, we have dv c(F ) = d, where d = the number of parameters.Example. f(w, x) = sin(wx) ⇒ dv c(F ) = ∞, even though f is smooth.Set w = πc, where c has a binary representation 0.b1b2....bn1.Set xi= 2ifor i = 1, ..., n. Thensin(wxi) = sin(2i× π × 0.b1....bn1)= sin(π × b1....bi.bi+1....bn1)= sin(π × bi.b1+1....bn1)which implies that sign(sin(wxi)) = bi. Hence, we can always find a set of size n ∀n.Example (Neural Nets).f(θ, x) =kXi=1αiσ(βTix)|{z }squashingfunction+αoFor what σ : R → [0, 1] is dv c(F ) < ∞ ?For instance, if σ(α ) =11 + e−α+ cα3e−α2sin(α)| {z }Looks like a sigmoid buthas a sinusoid hidden in it, we have dv c(F ) = ∞. Take note that σ is convex left ofzero and concave right of zero.12 ΠF(n) for parameterized F , covering numbers, Rn(F )Consider the function h : Rd|{z}a× Rm|{z}x→ {+ − 1} that can be computed by an algorithm that takes as input,(a, x) ∈ Rd× Rm, and returns as h(a, x) after ≤ t operations:• arithmetic, (+, −, ×, ÷)• conditionals (<, >, ≤, ≥)• outputs ±1Definition. For a class, F , of real valued functions oncont.in az}|{Rd×X , we say h is a k − combination of sign(F )if:H = {x → g(sign(f1(a, x)), ...., sign(fk(a, x))) | a ∈ Rd} for fixed g : {±1}k→ {±1} and f1, ...., fk∈ F .E.g. For a t − step computable h, we have a 2t-combination of sign(F ) for F = polynomials of degree ≤ 2t.Theorem 1.1. For H a k − combination of sign(F ),ΠH(n) ≤dXi=0knimax{fj}∈F,{xj}∈XCCi\j=1{a | fj(a, xj) = 0}|{z }number of connected componentsin the solution setExample. Linear threshold function (1 − combination of sign(F ))• fjis linear in a.• CCi\j=1{a | fj(a, xj) = 0}| {z }defines a subspace= 0 or 1Corollary 1.2. For F , polynomialy parameterized, with degree ≤ m, we haveΠH(n) ≤ 22enkmdddv c(H) ≤ 2d log (2ekm)ΠF(n) for parameterized F , covering numbers, Rn(F ) 3• Hence, t − step computable, H has dv c(H) ≤ 4d(t + 2)(using ΠH(n) < 2n⇒ dv c(H) < n).Note: With the addition of exponentials in the model of computation, we have dv c(H) = O(t2d2).Proof. Proof idea of previous theorem.• ΠH(n) = max{|H|S| : S ⊆ X , |S| = n}• Zij= {a | fi(a, xj) = 0}, assume regular intersections between these subspaces.Lemma 1.3 (Warren 1960).CCRd−[i,jZij≤XI⊆{(i,j)}CC \i∈IZi!Summarize: dv c(H) = O(dt) for t − step computeable h: 2t− combination of sign(F ) for F = polynomialwith degree ≤ 2t.2 Covering NumbersDefinition. For a metric space, (S, ρ), and a subspace, T ⊆ S, we say thatˆT is an ε − coverof T if∀ t ∈ T, ∃ˆt ∈ˆT such that ρ(t,ˆt) < ε.Definition. The ε − covering numberof (T, ρ):N(ε, T, ρ) = min{|ˆT | :ˆT is an ε − cover of T }.Note: Entrop y := log N (ε, T, ρ)Example. T ⊆ [0, 1]nis a d-dimensional subspace. A bound on the covering number for this subspace canbe found in terms of a uniform grid of ε-balls over the subspace, i.e.N(ε, T, L2(Pn)) ≤1εd4 ΠF(n) for parameterized F , covering numbers, Rn(F )Consider,• F ⊆ [−1, 1]X• S = {x1, ...., xn} ⊆ X .• F|s= {(f(x1), ...., f(xn) | f ∈ F } ⊆ [−1, 1]n.• L2(ˆP ), ρ(u, v) =1nPi(ui− vi)21/2.Theorem 2.1.ˆRn(F ) ≤ infα>0s2 log(N(α, F, L2(ˆP ))n+ αProof. Fix α, α-coverˆF ofF .ˆRn(F ) = Eεsupf∈F1nnXi=1εif(xi)= E supˆf∈ˆFsupf∈F ∩Bα(ˆf)1nXεiˆf(xi) +1nXεi(f(xi) −ˆf(xi))| {z }h ε|{z}||ε||=1,f −ˆf|{z}k·k≤αiL2( ˆp)≤ E"supˆf∈ˆF1nXεiˆf(xi)+ α#Note:• F =Sˆf∈ˆF(F ∩ Bα(ˆf))• |ˆF | = N(α, F, L2(ˆP ))⇒ log N(α, F ) = d log(1/α) for the linear case.Set α =1√n, ⇒ Rn(F ) ≤q2d
View Full Document