EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009Lecture 10 — February 23Lecturer: Martin Wainwright Scribe: Dave GollandNote: These lecture notes a re still rough, and have only have been mildly proofread.10.1 Announcements• HW #2 due.• HW #3: posted, due Monday, March 9t h.10.2 Outline• Classification and surrogate losses• Hinge loss and SVM• Statistical Consequences10.3 Classification and Surrogate Losses10.3.1 ClassificationClassification Problem StatementThe standard classification problem is stated as follows: given a set of samples(x(i), y(i))|i = 1, . . . , n,a class of functions F, a loss function, and a regularization function Ω, find the functionf ∈ F that minimizes the sum of the loss on each point in the set and the regularization off.The regularizer, Ω, controls the size or complexity of f . It is necessary for avoiding fittingnoise in the data.Binary ClassificationFor binary classification, each element x(i)∈ X in the pair should be thought of as a mea-surement, and y(i)∈ {−1, +1} the corresponding label.Recall that (one form of) binary classification involves the 0-1 loss:I[y 6= s ign(f(x))] =(1 if y 6= sign(f(x))0 otherwise(10.1)10-1EECS 281B / STAT 241B Lecture 10 — February 23 Spring 2009Given some class of functions F and regularizer Ω, under this loss the classificationproblem becomes:minf∈F1nnXi=1I[y(i)6= f(x(i))] + λnΩ(f) (10.2)Non-ConvexityThis optimization using the 0-1 loss is unpleasant because the indicator function, I, is notconvex.To visualize the non-convexity, we write the 0-1 loss as a function of margin, t, (associatedwith a specific pair (x, y)).t = yf (x) (10.3)Hence, the 0-1 loss becomes:I[y 6= sign(f(x))] = I<0[t] =(1 if t < 00 otherwise.(10.4)This form of the loss is illustrated in Figure 10.1.−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−0.200.20.40.60.811.21.41.61.8t (margin)I[t < 0]0−1 LossFigure 10.1. Plot of the 0 − 1 loss versus the margin..In general, there are no guarantees about minimization with non-convex optimizationfunctions.10.4 Surrogate LossesMany practical methods replace the non-convex, 0-1 loss, I<0, with a convex, surrogate lossfunction, φ.10-2EECS 281B / STAT 241B Lecture 10 — February 23 Spring 200910.4.1 Examples of Common Surrogate Loss FunctionsThe following are examples of common convex surroga t e loss functions.As I<0above, these loss functions are defined in terms of the margin, t, (see 10.3).Hinge LossThe hinge loss is defined as follows:φhinge(t) = max(0, 1 − t) = (1 − t)+(10.5)(Shown in Figure 10.2)Figure 10.2. Plot of hinge lossComments• φhinge(t) is not differentiable at t = 1. However, φhingecan still be optimized despitethis non-differentiable point because the function is still convex.• The SVM solves the classification problem by using the hinge loss as a surrogate forthe 0 -1 loss.Exponential LossThe exponential loss function is defined as follows:φexp(t) = exp(−t) (10.6)(Shown in Figure 10.3)10-3EECS 281B / STAT 241B Lecture 10 — February 23 Spring 2009Figure 10.3. Plot of exponential lossCommentsBoosting algorithms solve the classification problem by using the exponential loss as asurrogate for t he 0-1 loss.Logistic LossThe logistic lossfunction is defined as follows:φlogi(t) = log(1 + e−t) (10.7)(Shown in Figure 10.4)Figure 10.4. Plot of logistic loss10-4EECS 281B / STAT 241B Lecture 10 — February 23 Spring 200910.4.2 Classification with Surrogate LossAfter picking some convex surrogate loss, φ, the classification problem becomes:minf∈F(1nnXi=1φy(i), f(x(i))+ λnΩ(f))(10.8)Special Case: F ≡ RKHSWe can fix the set of classification functions F to b e the set of reproducing kernel Hilbertspaces, and the regularization function Ω to be Ω(f) =12kfk2H.In t his case, the representer theorem applies and we can solve the problem by duality.10.5 Conjugate Duality10.5.1 Useful Form of the DualGiven a convex function φ : R → R ∪ {+∞}, we can define its conjugate function:φ∗(s) = supt∈R{st − φ(t)} (10.9)Proposition: when φ is convex and Ω(f) =12kfk2H, the problem 10.8 has the dual form:maxα∈Rn(−1λnXi=1φ∗(λαi) −12αT[yyT⊙ K]α)(10.10)Note: (yyT⊙ K)ij= y(i)y(j)Kij. The operation: ⊙ is known as the “Hadamard” prod-uct. where y = (y(1), . . . , y(n))T≡ “label vector” and Kij= K(x(i), x(j)) ≡ “kernel Gram matrix”The optimal solution to 1 0.8 is:ˆf(·) =nXi=1ˆαiK(·, x(i)) (10.11)The proof in Ng uyen et al, as posted on the course webpage.10.5.2 Geometry of ConjugacyGiven a convex function φ(t), consider −φ∗(s), the Y -intercept of the supporting hyperplaneto φ(t) with normal (s, −1). (See Figure 10.5)Note: When s < 0 the supporting hyperplane has a negative slope. Since in this example,φ is never decreasing, the intercept, −φ∗(s), goes to −∞ when s < 0. Meaning the dualtakes the value −∞ when s < 0. When the dual takes the value −∞ at some points, it actsas a hard constraint.10-5EECS 281B / STAT 241B Lecture 10 — February 23 Spring 2009Figure 10.5. Geometry of Conjugacy10.5.3 Example: Hinge LossRecall that the Hinge loss is defined as:φ(t) = max(0, 1 − t) (10.12)We compute the dual of the Hinge loss:φ∗hinge(s) = supt∈R{st − max(0, 1 − t)} (10.13)We do a change a variables: u ← 1 − t.φ∗hinge(s) = s + supu∈R{−su − max(0, u)} (10.14)This allows us to easily plot φ and visually determine φ∗for different values of s. (SeeFigure 10.6)From the graph it is clear that the y-intercept of the supporting hyperplane is given by:φ∗hinge(s) =(+∞ if s > 0 or s < −1s otherwise(10.15)Note that the quantity inside the sup is differentiable everywhere except at the pointt = 1.10-6EECS 281B / STAT 241B Lecture 10 — February 23 Spring 2009Figure 10.6. Hinge Loss DualHence,−1λφ∗hinge(−λαi) =(αiif 0 ≤ αi≤1λ−∞ otherwise(10.16)Therefore, 10.10 has dual form:max0≤αi≤1λ|i=1,...,n(nXi=1αi−12αTyyT⊙ Kα)(10.17)The big picture is: keep this result in mind to skip the dirty, hands-on computation.10.6 Classification CalibrationWhat are t he statistical consequences of optimizing a φ-loss instead of 0-1 loss? In thissection, we pursue the question o f why classification still works with hinge loss.10.6.1 Motivation and DefinitionDefining HφWe define the Bayes’ errorto be:R∗= inffE[I[Y 6= f(X)]] (10.18)We define the optimal φ riskto be:R∗φ= inffE[φ(Y f(X))]
View Full Document