Berkeley STAT C241B - Lecture 3 - D1903462

Home> Schools> University of California, Berkeley> Statistics (STAT) > STAT C241B> Lecture 3

DOC PREVIEW

Berkeley STAT C241B - Lecture 3

School name University of California, Berkeley

Course Stat C241b- Advanced Topics in Learning and Decision Making

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009Lecture 3 — January 28Lecturer: Pradeep Ravikumar Scribe: Timothy J. WheelerNote: These lecture notes a re still rough, and have only have been mildly proofread.3.1 Recap of last lectureAs in the previous lecture, we assume that (X, Y ) ∼ P , where X takes values in X = Rd,and Y takes values in Y = {−1, 1}. Let Dn= {(X(i), Y(i))}ni=1be a set of n i.i.d. samplesfrom P . Each θ ∈ Rddefines a function fθ(x) = hθ, xi. We consider the decision ruleg( X) = sgn(fθ(X)).Recall that the margin is defined asδ(θ, Dn) = min1≤i≤nY(i)θ, X(i)kθk2,and the radius of the data set is defined asR(Dn) = max1≤i≤nkX(i)k2.In the previous lecture, we proved that if Dnis linearly separable, then the PerceptronAlgorithm converges in T = R2/δ2steps.3.2 Motivation for maximizing the marginThe 0-1 loss function is given byℓ(fθ(X), Y ) =(1 if sgn(fθ(X)) 6= y,0 otherwise.Let k be chosen uniformly at random fro m the set {0, 1, . . . , n −1}, and define the truncateddata set Dn,k= {(X(i), Y(i))}ki=1. Use the Perceptron Algorithm to compute a classifier fn,kfor the set Dn,k. Let fndenote the resulting classifier, which depends on the random variablesk and Dn.3-1EECS 281B / STAT 241B Lecture 3 — January 28 Spring 2009To compute the risk of the classifier fn, we take the expectation of the loss over therandom variables k, X, Y , and DnEkEX,Y,Dnℓ(fn(X), Y ) =1nn−1Xk=0EX,Y,Dnℓ(fn,k(X), Y )=1nn−1Xk=0EDnℓfn,k(X(k+1)), Y(k+1)≤1n(# of mistakes)≤R2nδ2Here, the first equality follows from the fact that k is uniformly distributed. Note that(X(k+1), Y(k+1)) ∼ P , and the random variable fnis independent of t he data {(X(i), Y(i))}ni=k+1,which was not used in training. Hence, we have the second equality. The last inequalityfollows from the result we proved last lecture, which said that the total number of mistakesmade by the Perceptron Algorithm is bounded by R2/δ2. Thus, we conclude that, in thiscase, maximizing the margin minimizes risk.3.3 Max-margin as an optimization problemNext, we cast the problem of maximizing the margin a s an optimization problemmaxδ≥0, θ∈Rdδs.t.Y(i)θ, X(i)kθk2≥ δ, i = 1, . . . , nSince the vector θ is normalized, we can take kθk2= 1/δ and rewrite our problem asminθ∈Rd12kθk22s.t. Y(i)θ, X(i)≥ 1, i = 1, . . . , n.(3.1)Because the objective function is quadratic and the constraints are affine, this optimizationproblem is called a quadratic program (QP). We refer to (3.1) as the primal problem. A pointθ ∈ Rdis feasible for the pro blem (3.1) if it satisfies all the constraints. Note that a feasiblepoint exists if and only if the data are linearly separable. In the following derivation, weassume that the data are linearly separable.3-2EECS 281B / STAT 241B Lecture 3 — January 28 Spring 20093.3.1 The dual formulationDefine the LagrangianL(θ, α) =12kθk22+nXi=1αi(1 − Y(i)θ, X(i)),where α ≥ 0 (component-wise). The αiare called the dual variables. Ifˆθ is a feasible pointfor problem (3.1), t hen (1 − Y(i)hˆθ, X(i)i) ≤ 0, for i = 1, . . . , n. Hence,supα≥0L(ˆθ, α) =(12kˆθk22, ifˆθ is feasible+∞ otherwise.(3.2)Therefore, computingp∗∆= infθ∈Rdsupα≥0L(θ, α),is the same as solving the primal problem (3.1).By equation (3.2), the following relations hold for all feasibleˆθ ∈ Rdand all α ≥ 0:supα≥0L(ˆθ, α) =12kˆθk22≥ L(ˆθ, α) ≥ infθ∈RdL(θ, α). (3.3)Since this holds for all feasibleˆθ a nd all α ≥ 0, we can take the infimum on the left and thesupremum on the right to getp∗= infθ∈Rdsupα≥0L(θ, α) ≥ supα≥0infθ∈RdL(θ, α)∆= q∗(3.4)The optimizationq∗= supα≥0infθ∈RdL(θ, α) (3.5)is known as the dual problem, and the inequality (3.4) is a result known as weak duality.Because the original problem (3.1) satisfies Slater’s condition, we actually have strong duality(i.e., p∗= q∗).The Lagrangian L(θ, α) is a convex function, so the θ that minimizes L(θ, α) is given by∂L(θ , α)∂θ= θ +nXi=1αi(−Y(i)X(i)) = 0, (3.6)which yields θ =PαiY(i)X(i). Substituting this value of θ into L(θ, α) givesq(α)∆= infθL(θ, α) = −12αT(K ⊙ Y )α + hα , ei ,where K, Y ∈ Rn×nare the Gram matrices given by Kij=X(i), X(j)and Yij= Y(i)Y(j),and e ∈ Rnis the vector o f ones. The symbol ⊙ denotes the Hadamard (element-by-element)product.3-3EECS 281B / STAT 241B Lecture 3 — January 28 Spring 20093.3.2 InterpretationSuppose that (θ∗, α∗) is an o ptimum for the Lagrangian. From equations (3.2) and (3.6), anoptimum point has the following properties:1. θ∗=Pni=1α∗iY(i)X(i).2. If αi> 0, then 1 − Y(i)θ∗, X(i)= 0.The second condition is known as complementary slackness. Together, these two conditionstell us that an optimum θ∗only depends on the data (X(i), Y(i)) where the corresponding α∗iare nonzero. The points X(i)where α∗i> 0 are called support vectors, and they are containedin the hyperplanes {X ∈ Rd| hθ∗, Xi = 1} and {X ∈ Rd| hθ∗, Xi = −1}. This method offinding the maximum-margin classifier is called the support vector machine.3.4 ExtensionWhat if the data Dnare not linearly separable? Observe that for an optimum (θ∗, α∗),hθ∗, Xi =nXi=1α∗iY(i)hX(i), Xi.Hence, we are mainly concerned with the inner product on X . If we extend o ur notion ofinner product, we can use the supp ort vector machine ideas described above to classify datasets that are not linearly separable.Consider the data set shown in Figure 3.1. Clearly, these data are not linearly separable.However, if we apply the map(X1, X2) 7→ (1, X1, X2, X1X2, X21, X22),and take our inner product in R6, rather than R2, we can use the techniques described aboveto determine θ∗∈ R6that correctly classifies the data (see Figure 3.2).This intuition motivates the use of kernel methods, to which we turn in the next lectures.3-4EECS 281B / STAT 241B Lecture 3 — January 28 Spring 2009−3 −2 −1 0 1 2 3−3−2−10123X1X2 Y = −1Y=+1Figure 3.1. A set of data th at is not linearly separable.−3 −2 −1 0 1 2 3−3−2−10123X1X2 ClassifierY = −1Y=+1Figure 3.2. Classifier that separates th e

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

Berkeley STAT C241B - Lecture 3

Sign up for free to view:

Please select your school