DOC PREVIEW
Berkeley STAT C241B - Lecture Notes

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009Lecture 10 — February 23Lecturer: Martin Wainwright Scribe: Dave GollandNote: These lecture notes a re still rough, and have only have been mildly proofread.10.1 Announcements• HW #2 due.• HW #3: posted, due Monday, March 9t h.10.2 Outline• Classification and surrogate losses• Hinge loss and SVM• Statistical Consequences10.3 Classification and Surrogate Losses10.3.1 ClassificationClassification Problem StatementThe standard classification problem is stated as follows: given a set of samples(x(i), y(i))|i = 1, . . . , n,a class of functions F, a loss function, and a regularization function Ω, find the functionf ∈ F that minimizes the sum of the loss on each point in the set and the regularization off.The regularizer, Ω, controls the size or complexity of f . It is necessary for avoiding fittingnoise in the data.Binary ClassificationFor binary classification, each element x(i)∈ X in the pair should be thought of as a mea-surement, and y(i)∈ {−1, +1} the corresponding label.Recall that (one form of) binary classification involves the 0-1 loss:I[y 6= s ign(f(x))] =(1 if y 6= sign(f(x))0 otherwise(10.1)10-1EECS 281B / STAT 241B Lecture 10 — February 23 Spring 2009Given some class of functions F and regularizer Ω, under this loss the classificationproblem becomes:minf∈F1nnXi=1I[y(i)6= f(x(i))] + λnΩ(f) (10.2)Non-ConvexityThis optimization using the 0-1 loss is unpleasant because the indicator function, I, is notconvex.To visualize the non-convexity, we write the 0-1 loss as a function of margin, t, (associatedwith a specific pair (x, y)).t = yf (x) (10.3)Hence, the 0-1 loss becomes:I[y 6= sign(f(x))] = I<0[t] =(1 if t < 00 otherwise.(10.4)This form of the loss is illustrated in Figure 10.1.−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−0.200.20.40.60.811.21.41.61.8t (margin)I[t < 0]0−1 LossFigure 10.1. Plot of the 0 − 1 loss versus the margin..In general, there are no guarantees about minimization with non-convex optimizationfunctions.10.4 Surrogate LossesMany practical methods replace the non-convex, 0-1 loss, I<0, with a convex, surrogate lossfunction, φ.10-2EECS 281B / STAT 241B Lecture 10 — February 23 Spring 200910.4.1 Examples of Common Surrogate Loss FunctionsThe following are examples of common convex surroga t e loss functions.As I<0above, these loss functions are defined in terms of the margin, t, (see 10.3).Hinge LossThe hinge loss is defined as follows:φhinge(t) = max(0, 1 − t) = (1 − t)+(10.5)(Shown in Figure 10.2)Figure 10.2. Plot of hinge lossComments• φhinge(t) is not differentiable at t = 1. However, φhingecan still be optimized despitethis non-differentiable point because the function is still convex.• The SVM solves the classification problem by using the hinge loss as a surrogate forthe 0 -1 loss.Exponential LossThe exponential loss function is defined as follows:φexp(t) = exp(−t) (10.6)(Shown in Figure 10.3)10-3EECS 281B / STAT 241B Lecture 10 — February 23 Spring 2009Figure 10.3. Plot of exponential lossCommentsBoosting algorithms solve the classification problem by using the exponential loss as asurrogate for t he 0-1 loss.Logistic LossThe logistic lossfunction is defined as follows:φlogi(t) = log(1 + e−t) (10.7)(Shown in Figure 10.4)Figure 10.4. Plot of logistic loss10-4EECS 281B / STAT 241B Lecture 10 — February 23 Spring 200910.4.2 Classification with Surrogate LossAfter picking some convex surrogate loss, φ, the classification problem becomes:minf∈F(1nnXi=1φy(i), f(x(i))+ λnΩ(f))(10.8)Special Case: F ≡ RKHSWe can fix the set of classification functions F to b e the set of reproducing kernel Hilbertspaces, and the regularization function Ω to be Ω(f) =12kfk2H.In t his case, the representer theorem applies and we can solve the problem by duality.10.5 Conjugate Duality10.5.1 Useful Form of the DualGiven a convex function φ : R → R ∪ {+∞}, we can define its conjugate function:φ∗(s) = supt∈R{st − φ(t)} (10.9)Proposition: when φ is convex and Ω(f) =12kfk2H, the problem 10.8 has the dual form:maxα∈Rn(−1λnXi=1φ∗(λαi) −12αT[yyT⊙ K]α)(10.10)Note: (yyT⊙ K)ij= y(i)y(j)Kij. The operation: ⊙ is known as the “Hadamard” prod-uct. where y = (y(1), . . . , y(n))T≡ “label vector” and Kij= K(x(i), x(j)) ≡ “kernel Gram matrix”The optimal solution to 1 0.8 is:ˆf(·) =nXi=1ˆαiK(·, x(i)) (10.11)The proof in Ng uyen et al, as posted on the course webpage.10.5.2 Geometry of ConjugacyGiven a convex function φ(t), consider −φ∗(s), the Y -intercept of the supporting hyperplaneto φ(t) with normal (s, −1). (See Figure 10.5)Note: When s < 0 the supporting hyperplane has a negative slope. Since in this example,φ is never decreasing, the intercept, −φ∗(s), goes to −∞ when s < 0. Meaning the dualtakes the value −∞ when s < 0. When the dual takes the value −∞ at some points, it actsas a hard constraint.10-5EECS 281B / STAT 241B Lecture 10 — February 23 Spring 2009Figure 10.5. Geometry of Conjugacy10.5.3 Example: Hinge LossRecall that the Hinge loss is defined as:φ(t) = max(0, 1 − t) (10.12)We compute the dual of the Hinge loss:φ∗hinge(s) = supt∈R{st − max(0, 1 − t)} (10.13)We do a change a variables: u ← 1 − t.φ∗hinge(s) = s + supu∈R{−su − max(0, u)} (10.14)This allows us to easily plot φ and visually determine φ∗for different values of s. (SeeFigure 10.6)From the graph it is clear that the y-intercept of the supporting hyperplane is given by:φ∗hinge(s) =(+∞ if s > 0 or s < −1s otherwise(10.15)Note that the quantity inside the sup is differentiable everywhere except at the pointt = 1.10-6EECS 281B / STAT 241B Lecture 10 — February 23 Spring 2009Figure 10.6. Hinge Loss DualHence,−1λφ∗hinge(−λαi) =(αiif 0 ≤ αi≤1λ−∞ otherwise(10.16)Therefore, 10.10 has dual form:max0≤αi≤1λ|i=1,...,n(nXi=1αi−12αTyyT⊙ Kα)(10.17)The big picture is: keep this result in mind to skip the dirty, hands-on computation.10.6 Classification CalibrationWhat are t he statistical consequences of optimizing a φ-loss instead of 0-1 loss? In thissection, we pursue the question o f why classification still works with hinge loss.10.6.1 Motivation and DefinitionDefining HφWe define the Bayes’ errorto be:R∗= inffE[I[Y 6= f(X)]] (10.18)We define the optimal φ riskto be:R∗φ= inffE[φ(Y f(X))]


View Full Document

Berkeley STAT C241B - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?