Berkeley STAT C241B - Lecture Notes - D2451615

Home> Schools> University of California, Berkeley> Statistics (STAT) > STAT C241B> Lecture Notes

DOC PREVIEW

Berkeley STAT C241B - Lecture Notes

School name University of California, Berkeley

Course Stat C241b- Advanced Topics in Learning and Decision Making

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009Lecture 10 — February 23Lecturer: Martin Wainwright Scribe: Dave GollandNote: These lecture notes a re still rough, and have only have been mildly proofread.10.1 Announcements• HW #2 due.• HW #3: posted, due Monday, March 9t h.10.2 Outline• Classification and surrogate losses• Hinge loss and SVM• Statistical Consequences10.3 Classification and Surrogate Losses10.3.1 ClassificationClassification Problem StatementThe standard classification problem is stated as follows: given a set of samples(x(i), y(i))|i = 1, . . . , n,a class of functions F, a loss function, and a regularization function Ω, find the functionf ∈ F that minimizes the sum of the loss on each point in the set and the regularization off.The regularizer, Ω, controls the size or complexity of f . It is necessary for avoiding fittingnoise in the data.Binary ClassificationFor binary classification, each element x(i)∈ X in the pair should be thought of as a mea-surement, and y(i)∈ {−1, +1} the corresponding label.Recall that (one form of) binary classification involves the 0-1 loss:I[y 6= s ign(f(x))] =(1 if y 6= sign(f(x))0 otherwise(10.1)10-1EECS 281B / STAT 241B Lecture 10 — February 23 Spring 2009Given some class of functions F and regularizer Ω, under this loss the classificationproblem becomes:minf∈F1nnXi=1I[y(i)6= f(x(i))] + λnΩ(f) (10.2)Non-ConvexityThis optimization using the 0-1 loss is unpleasant because the indicator function, I, is notconvex.To visualize the non-convexity, we write the 0-1 loss as a function of margin, t, (associatedwith a specific pair (x, y)).t = yf (x) (10.3)Hence, the 0-1 loss becomes:I[y 6= sign(f(x))] = I<0[t] =(1 if t < 00 otherwise.(10.4)This form of the loss is illustrated in Figure 10.1.−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−0.200.20.40.60.811.21.41.61.8t (margin)I[t < 0]0−1 LossFigure 10.1. Plot of the 0 − 1 loss versus the margin..In general, there are no guarantees about minimization with non-convex optimizationfunctions.10.4 Surrogate LossesMany practical methods replace the non-convex, 0-1 loss, I<0, with a convex, surrogate lossfunction, φ.10-2EECS 281B / STAT 241B Lecture 10 — February 23 Spring 200910.4.1 Examples of Common Surrogate Loss FunctionsThe following are examples of common convex surroga t e loss functions.As I<0above, these loss functions are defined in terms of the margin, t, (see 10.3).Hinge LossThe hinge loss is defined as follows:φhinge(t) = max(0, 1 − t) = (1 − t)+(10.5)(Shown in Figure 10.2)Figure 10.2. Plot of hinge lossComments• φhinge(t) is not differentiable at t = 1. However, φhingecan still be optimized despitethis non-differentiable point because the function is still convex.• The SVM solves the classification problem by using the hinge loss as a surrogate forthe 0 -1 loss.Exponential LossThe exponential loss function is defined as follows:φexp(t) = exp(−t) (10.6)(Shown in Figure 10.3)10-3EECS 281B / STAT 241B Lecture 10 — February 23 Spring 2009Figure 10.3. Plot of exponential lossCommentsBoosting algorithms solve the classification problem by using the exponential loss as asurrogate for t he 0-1 loss.Logistic LossThe logistic lossfunction is defined as follows:φlogi(t) = log(1 + e−t) (10.7)(Shown in Figure 10.4)Figure 10.4. Plot of logistic loss10-4EECS 281B / STAT 241B Lecture 10 — February 23 Spring 200910.4.2 Classification with Surrogate LossAfter picking some convex surrogate loss, φ, the classification problem becomes:minf∈F(1nnXi=1φy(i), f(x(i))+ λnΩ(f))(10.8)Special Case: F ≡ RKHSWe can fix the set of classification functions F to b e the set of reproducing kernel Hilbertspaces, and the regularization function Ω to be Ω(f) =12kfk2H.In t his case, the representer theorem applies and we can solve the problem by duality.10.5 Conjugate Duality10.5.1 Useful Form of the DualGiven a convex function φ : R → R ∪ {+∞}, we can define its conjugate function:φ∗(s) = supt∈R{st − φ(t)} (10.9)Proposition: when φ is convex and Ω(f) =12kfk2H, the problem 10.8 has the dual form:maxα∈Rn(−1λnXi=1φ∗(λαi) −12αT[yyT⊙ K]α)(10.10)Note: (yyT⊙ K)ij= y(i)y(j)Kij. The operation: ⊙ is known as the “Hadamard” prod-uct. where y = (y(1), . . . , y(n))T≡ “label vector” and Kij= K(x(i), x(j)) ≡ “kernel Gram matrix”The optimal solution to 1 0.8 is:ˆf(·) =nXi=1ˆαiK(·, x(i)) (10.11)The proof in Ng uyen et al, as posted on the course webpage.10.5.2 Geometry of ConjugacyGiven a convex function φ(t), consider −φ∗(s), the Y -intercept of the supporting hyperplaneto φ(t) with normal (s, −1). (See Figure 10.5)Note: When s < 0 the supporting hyperplane has a negative slope. Since in this example,φ is never decreasing, the intercept, −φ∗(s), goes to −∞ when s < 0. Meaning the dualtakes the value −∞ when s < 0. When the dual takes the value −∞ at some points, it actsas a hard constraint.10-5EECS 281B / STAT 241B Lecture 10 — February 23 Spring 2009Figure 10.5. Geometry of Conjugacy10.5.3 Example: Hinge LossRecall that the Hinge loss is defined as:φ(t) = max(0, 1 − t) (10.12)We compute the dual of the Hinge loss:φ∗hinge(s) = supt∈R{st − max(0, 1 − t)} (10.13)We do a change a variables: u ← 1 − t.φ∗hinge(s) = s + supu∈R{−su − max(0, u)} (10.14)This allows us to easily plot φ and visually determine φ∗for different values of s. (SeeFigure 10.6)From the graph it is clear that the y-intercept of the supporting hyperplane is given by:φ∗hinge(s) =(+∞ if s > 0 or s < −1s otherwise(10.15)Note that the quantity inside the sup is differentiable everywhere except at the pointt = 1.10-6EECS 281B / STAT 241B Lecture 10 — February 23 Spring 2009Figure 10.6. Hinge Loss DualHence,−1λφ∗hinge(−λαi) =(αiif 0 ≤ αi≤1λ−∞ otherwise(10.16)Therefore, 10.10 has dual form:max0≤αi≤1λ|i=1,...,n(nXi=1αi−12αTyyT⊙ Kα)(10.17)The big picture is: keep this result in mind to skip the dirty, hands-on computation.10.6 Classification CalibrationWhat are t he statistical consequences of optimizing a φ-loss instead of 0-1 loss? In thissection, we pursue the question o f why classification still works with hinge loss.10.6.1 Motivation and DefinitionDefining HφWe define the Bayes’ errorto be:R∗= inffE[I[Y 6= f(X)]] (10.18)We define the optimal φ riskto be:R∗φ= inffE[φ(Y f(X))]

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 9 pages.

Berkeley STAT C241B - Lecture Notes

Sign up for free to view:

Please select your school