Berkeley COMPSCI C281B - Ada Boost, Risk Bounds, Concentration Inequalities - D2459329

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C281B> Ada Boost, Risk Bounds, Concentration Inequalities

DOC PREVIEW

Berkeley COMPSCI C281B - Ada Boost, Risk Bounds, Concentration Inequalities

School name University of California, Berkeley

Course Compsci C281b- Advanced Topics in Learning and Decision Making

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 12Ada Boost, Risk Bounds, Concentration InequalitiesLecturer: Peter Bartlett Scribe: Subhransu Maji1 AdaBoost and Estimates o f Conditional ProbabilitiesWe continue with our discussion on AdaBoost and derive risk bounds of the classifier. Recall that for a func-tion f , we have the following relationship between the expected exce ss risk and the excess φ approximationrisk for a loss function φ,Ψ (R(f) − R∗) ≤ Rφ(f) − R∗φwhere, R∗is the optimal Bayes Risk, R∗φis the risk of the optimal f i.e.R∗φ= inffR∗φ(f) and H(η) is thefunctionH(η) = infα∈R[ηφ(α) + (1 − η)φ(−α)]and Ψ(θ) is the functionΨ(θ) = H−1 + θ2+ H1 + θ2= φ(0) − infα∈R1 + θ2φ(α) +1 − θ2φ(−α)In the context of AdaBoos t the loss function φ(α) = e−αis convex and classification calibrated. Thus,H(η) = infα∈Rηe−α+ (1 − η)eα Differentiating w.r.t. α and setting to zero gives us the optimalα(η) = lnrη1 − ηThis suggests that if we could choose f (x) separately for each x, it would be a monotonically transformedversion of conditional probability(see next section). Plugging this α∗into H yieldsH(η) = 2pη(1 − η),which is concave and symmetric around 1/2. Then Ψ(θ) s implifies toΨ(θ) = 1 −p(1 + θ)(1 − θ) = 1 −p1 − θ2.Finally, plugging this in to the original inequality yields1 −p1 − (R(f) − R∗)2≤ Rφ(f) − R∗φ.Examining the Taylor series of the left side about 0 shows that this is equivalent, for some constant c, toR(f) − R∗≤ cq(Rφ(f) − R∗φ)when the excess φ-risk is sufficiently small. Thus, driving the excess φ-risk to zero will drive the disc rete lossto zero as well, which justifies AdaBoost’s use of this particular convex loss function.12 Ada Boost, Risk Bounds, Concentration Inequalities2 Relationship to logistic re gr essionIt turns out that we can interpret the value of F (x) (where F is the boosted classifier returned by AdaBoost)as a transformed estimate of Pr(Y = 1|X = x). Consider a logistic model wherePr(Y = 1|X = x) =11 + e−2f(x)=ef(x)ef(x)+ e−f(x),a rescaled version of the logistic function. In this model, the log loss (negative log likelihood) takes the form− lnnYi=1Pr(Y = yi|X = xi) = −Xyi=1ln11 + e−2f(x)−Xyi=−1ln1 −11 + e−2f(x)=Xyi=1ln1 + e−2f(x)+Xyi=−1ln1 + e2f(x)=nXi=1ln1 + e−2yif(xi).Thus, the maximum likelihood logistic regression solution attempts to minimize the sample average of φ(α) =ln1 + e−2α. This is closely related to AdaBoost, which minimizes the sample average of φ(α) = e−α. Tosee the connection, note that the first few terms of the Taylor expansion of ln1 + e−2α+ 1 − ln 2 about 0,1 − α +α22− . . . ,are identical to those of e−α.While the two functions are very similar near zero, their asymptotic behavior is very different. In generalwe have thatln1 + e−2α≤ e−α;furthermore, the former grows linearly as α approaches −∞, whereas the latter grows exponentially. Thus,we can view AdaBoost as approximating the maximum likelihood logistic regression solution, except with(sometimes exponentially) larger penalties for mistakes. A further similarity between the methods is thatthe α∗for φ(α) = ln1 + e−2αis the same as for AdaBoost.3 Risk Bounds and Uniform ConvergenceSo far, we’ve looked at algorithms (including AdaBoost) that optimize over a set of training samples:minf∈FˆR(f) =ˆEl(y, f(x)) =1nnXi=1l(yi, f(xi)).If the empirical minimizer isˆf, we are interested in bounding the true loss R(ˆf) = El(y,ˆf(x)) under thisfunction. In particular, we hope thatˆR(ˆf) will converge to inff∈FR(f) as n → ∞For the (trivial) case where our function class F contains only a single function, we can simply appeal to thelaw of large numbers. For exam ple, in the case of discrete loss, the Chernoff bound gives an upper boundon Pr(|ˆR(f) − R(f )| > ) that shrinks exponentially in n for any given .This argument, however, fails when F is not a singleton. We cannot simply apply the law of large numbersto each f ∈ F and then argue that the desired property holds when minimizing over all of F . The problemAda Boost, Risk Bounds, Concentration Inequalities 3is that we are considering R (argminf∈FˆR(f)), where the inner part depends on the data. In particular, if Fis such that for any n and data set there are functions f ∈ F with smallˆR(f) but large R(f), then choosingan f that minimizesˆR(f) may not tend to minimize R(f).Example. Let F = F+∪ F−withF+= {x 7→ f(x) : |{x : f(x) = +1}| < ∞}F−= {x 7→ f(x) : |{x : f(x) = −1}| < ∞}Note that for any finite sequence, we can choose f from either F+or F−to explain it. Now, suppose wehave a distribution P such that P (Y = 1|X) = 0.95 almost surely, and for all x, P (X = x) = 0. Then, wehavef ∈ F+⇒ R(f) = 0.05 = R∗f ∈ F−⇒ R(f) = 0.95 > R∗where, R∗is the Bayes risk. However, for any finite sample there is an f ∈ F+with R(f) = 0 butR(f) − R∗= 0.9. So, choosing a function from a class via empirical risk minimization does not guaranteerisk minimization with such a rich class. Res tated,R(argminf∈FR(f)) 6= inff∈FR(f)Example. If the set of functions F ∈ {+1, −1}Xis finite, then we can say something about the true riskgiven that the empirical risk is zero. The following theorem makes this explicit.Theorem 3.1.Pr(∃f ∈ F andˆR(f) = 0 & R(f) ≥ ) ≤ |F |e−ni.e., with probability at least 1 − δ,if,ˆR(f) = 0, then R(f) ≤log |F |n+log 1/δnProof. To show this we use the properties of the exponential functions and union bounds. For any f ∈ Fwith R(f) ≥ , we havePr(ˆR(f) = 0) ≤ (1 − )n= exp(n log(1 − ))≤ exp(−n) (1)Using the union bound (Boole’s inequality: the probability of a union of events is no more than the sum oftheir probabilities), we havePr[{f ∈ F : R(f) ≥  &ˆR(f) = 0}≤Xf∈FPrR(f) ≥  &ˆR(f) = 0≤ |F |e−n(2)Example. (Decision Trees) Consider the class of decision trees of finite number of nodes N over x ∈{+1, −1}d. Thus |F | ≤ (d + 2)N, because we can specify the tree by listing, in breadth-first order, the Nnodes of the tree, and each can be either one of the covariates or outputs {+1, −1}. Thus, ifˆR(f) = 0, thenwith probability ≥ 1 − δ,R(f) ≤N log(d + 2)n+log 1/δn4 Ada Boost, Risk

View Full Document