'&$%CHAPTER 12 CONSISTENCY OFINDIRECT LEARNING METHODS1'&$%Intuition of proof ideas• The decision rule is not explicit.• However, we know that best classifiers minimizes some lossfunction or regularized loss functions.• Thus,P (L(gn)−L(g∗) > ²) ≤ P (L(gn)−Ln(gn)−L(g∗)+Ln(g∗) > ²)≤ 2P (supg∈F|Ln(g) − L(g)| > ²/2).• We need control stochastic errors of such loss functions overthe model space,supg∈F|Ln(g) − L(g)|.• This uses concentration inequalities from empirical processesand relies on the model size of F.2'&$%Example 1: Maximum likelihood principle• Let F be the class of decision functions with values in [0, 1].• The estimated decision function, ηn(x) ∈ F, maximizesln(η) = n−1nXi=1[Yilog η(Xi) + (1 − Yi) log(1 − η(Xi))]so the decision rule is gn(x) = I(ηn(X) > 1/2).• Consistency result: suppose the true η ∈ F andN[](², F, L2(P )) ≤ (K/²)V. Then L(gn) → L(g∗)withprobability one.3'&$%Steps of the proof• DefineF²= {˜η : ˜η ∈ F, L(˜g) > L(g∗) + ²}where ˜g = I(˜η > 1/2).• NoteP (L(gn)−L(g∗) > ²) = P (ηn∈ F²) ≤ P ( sup˜η∈F²(ln(˜η)−ln(η)) > 0).• On the other hand,E[ln(η)] − E[ln(˜η)] = KL(η, ˜η)≥ E[η(X)(pη(X) −p˜η(X))2+(1 − η(X))(p1 − η(X) −p1 − ˜η(X))2]≥ E[(η(X) − ˜η(X))2]/4and ² ≤ L(˜g) − L(g∗) ≤ 2E[(η(X) − ˜η(X))2].4'&$%Completing the proof• Hence,P (L(gn) − L(g∗) > ²) ≤ P (ηn∈ F²)≤ P ( sup˜η∈F²{(ln(˜η) − ln(η)) − E[ln(˜η) − ln(η)]} > ²/8).• Need to apply the concentration inequality from empiricalprocesses to obtain an upper bound nae−n²2.5'&$%Example 2: Empirical risk minimization• gnminimizes the empirical Bayes riskLn(g) = n−1nXi=1I(Yi6= g(Xi))for g ∈ C.• Strong consistency: if C has a finite VC dimension v, thenL(gn) − infg∈CL(g) →a.s.0.• Remark: if g∗∈ C, then L(gn) → L(g∗) with probability one.6'&$%Proof• NoteL(gn) − infg∈CL(g) ≤ L(gn) − Ln(gn) + supg∈C(Ln(g) − L(g))≤ 2 supg∈C|Ln(g) − L(g)|.• Thus,P (L(gn) − infg∈CL(g) > ²)≤ P (supg∈C|Ln(g) − L(g)| > ²/2)≤ 8(en/v)ve−n²2/128.• The last step uses the concentration inequality for empiricalprocesses.7'&$%Concentration inequalities for VC-class• For a class F with finite VC-dimension v,P (supf∈F|Pnf − Pf| > ²) ≤ 8(en/v)ve−n²2/32.8'&$%Example 3: Empirical risk minimization with complexity regularization• Consider a sequence of modelsC(1)⊂ C(2)⊂ ...• Each C(k)has a finite VC-dimension vk.• Let g(k)nbe the best decision rule in model C(k)minimizingLn(g) for g ∈ C(k).• The best selected rule ˜g is one of g(1)n, g(2)n, ... which minimizesLn(g(k)n) + R(k, n), R(k, n) =p32vklog(en)/n.• Consistency result: suppose limkinfg∈C(k)L(g) = L(g∗) and∞Xk=1e−vk< ∞,then L(˜g) → L(g∗) with probability one.9'&$%Proof• NoteL(˜g) − L(g∗) = L(˜g) − infk≥1nLn(g(k)n) + R(k, n)o+(infk≥1nLn(g(k)n) + R(k, n)o− L(g∗)).10'&$%First term in RHS• For the first term,P (L(˜g) − infk≥1nLn(g(k)n) + R(k, n)o> ²)= P (L(˜g) − Ln(˜g) − R(˜k, n) > ²)≤Xk≥1P (L(g(k)n) − Ln(g(k)n) > ² + R(k, n))≤Xk≥1P ( supg∈C(k)|L(g) − Ln(g)| > ² + R(k, n))≤Xk≥18nvke−n(²+R(n,k))2/32≤ 8e−n²2/32Xke−vk.• The last step uses the concentration inequality for empiricalprocesses with VC-classes.11'&$%Second term in RHS• First, there exits some m such thatinfg∈C(m)L(g) < L(g∗) + ²/4and R(n, m) < ²/4.• Therefore,P (infk≥1(Ln(g(k)n) + R(k, n)) − L(g∗) > ²)≤P(Ln(g(m)n)−L(g(m)n)> ²/2)≤ 8nvme−n²2/128.• It follows because vm< ∞.12'&$%Remarks• If L(g∗) < limkinfg∈C(k)L(g), then the above proof impliesL(˜g) →a.s.limkinfg∈C(k)L(g)instead of L(g∗).13'&$%CHAPTER 13 CONVERGENCE RATES14'&$%Some results• Goal is to find some equality likeE[L(gn)] − L(g∗) < en−α.• Unfortunately, this is impossible: let andecrease to zero anda1< 1/16. For every sequence classifier {gn}, there exits somea distribution of (X, Y ) such that L(g∗) = 0 but E[L(gn)] ≥ an.• However, such rate exists if we impose restrictive assumption of(X, Y ) distributions (smoothness).• Many results on obtaining accurate convergence rates rely onusing concentration inequalities as seen in consistency proofs.• Open question: inference on E[L(gn)]?15'&$%CHAPTER 14 CLASSIFICATIONERROR ESTIMATION16'&$%Resubstitution estimate• Apparent error: n−1Pni=1I(ˆg(Xi) 6= Yi)• It is biased towards zero.• Using validation data to obtainm−1mXj=1I(ˆg(Xn+j) 6= Ynj).• Require large validation data.17'&$%Leave-one-out cross-validation estimate• Assume gnto be a symmetric classifier.• An upper bound result:En(ˆLCV− L(gn))2o≤ n−1+ 6P (gn(X) 6= gn−1(X)).• For k-NN rule,P (gn(X) 6= gn−1(X)) ≤ k/n.• For kernel rule with K(x) having support in the unit sphereand K(x) ≥ β for kxk ≤ ρ,P (gn(X) 6= gn−1(X)) ≤Cpβ√n(1 + ρ−1)p/2.18'&$%Other error estimates• Smooth error count:n−1nXi=1[Yi(1 − r(ηn(Xi))) + (1 − Yi)r(ηn(Xi))]where r is an increasing function satisfyingr(1/2 − u) + r(1/2 + u) = 1.• It tends to have a smaller variability than the resubstitutionestimate (r(u) = I(u ≥ 1/2)).• Posterior probability estimate:n−1nXi=1[I(ηn(Xi) ≤ 1/2)ηn(Xi) + I(ηn(Xi) > 1/2)(1 − ηn(Xi))]or its cross-validation version by replacing ηn(X − i) byηn,−i(Xi).19'&$%• K-fold cross-validation estimate• Bootstrap estimate: use a random sample to construct theclassifier than predict the errors in those not selected; repeatmany times then take average.20'&$%CHAPTER 15 CONCENTRATIONINEQUALITIES21'&$%Concentration Inequalities• What are concentration inequalities?• They are essentially inequalities for the tail probability of thedeviation of a random variable from its mean (or median)P (ξ(X1, ..., Xn) − E[ ξ(X1, ..., Xn)] ≥ x) ≤ exp{−x2/2ν}.• Large deviation or moderate deviation inequalities22'&$%Concentration Inequalities• The oldest oneP (f(Z) − E[f(Z)] ≥ x) ≤ exp{−x22L},where Z ∼ N(0, Id) and f is Lipschitz continuous withLipschitz constant L. Also, E[f(Z)] can be replaced by themedian.23'&$%Technique 1 in obtaining concentration inequalities• Cramer-Chernoff method (Chernoff inequality)P (Z ≥ x) ≤ exp{−ψ∗Z(x)},where ψ∗Z(x) = supλ{λx −log E[eλZ]} (Cramer transformation)24'&$%Resulting inequalities• Hoeffding’s inequality: X1, ..., Xnare
View Full Document