DOC PREVIEW
Berkeley STAT C241B - Lecture 3

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009Lecture 3 — January 28Lecturer: Pradeep Ravikumar Scribe: Timothy J. WheelerNote: These lecture notes a re still rough, and have only have been mildly proofread.3.1 Recap of last lectureAs in the previous lecture, we assume that (X, Y ) ∼ P , where X takes values in X = Rd,and Y takes values in Y = {−1, 1}. Let Dn= {(X(i), Y(i))}ni=1be a set of n i.i.d. samplesfrom P . Each θ ∈ Rddefines a function fθ(x) = hθ, xi. We consider the decision ruleg( X) = sgn(fθ(X)).Recall that the margin is defined asδ(θ, Dn) = min1≤i≤nY(i)θ, X(i)kθk2,and the radius of the data set is defined asR(Dn) = max1≤i≤nkX(i)k2.In the previous lecture, we proved that if Dnis linearly separable, then the PerceptronAlgorithm converges in T = R2/δ2steps.3.2 Motivation for maximizing the marginThe 0-1 loss function is given byℓ(fθ(X), Y ) =(1 if sgn(fθ(X)) 6= y,0 otherwise.Let k be chosen uniformly at random fro m the set {0, 1, . . . , n −1}, and define the truncateddata set Dn,k= {(X(i), Y(i))}ki=1. Use the Perceptron Algorithm to compute a classifier fn,kfor the set Dn,k. Let fndenote the resulting classifier, which depends on the random variablesk and Dn.3-1EECS 281B / STAT 241B Lecture 3 — January 28 Spring 2009To compute the risk of the classifier fn, we take the expectation of the loss over therandom variables k, X, Y , and DnEkEX,Y,Dnℓ(fn(X), Y ) =1nn−1Xk=0EX,Y,Dnℓ(fn,k(X), Y )=1nn−1Xk=0EDnℓfn,k(X(k+1)), Y(k+1)≤1n(# of mistakes)≤R2nδ2Here, the first equality follows from the fact that k is uniformly distributed. Note that(X(k+1), Y(k+1)) ∼ P , and the random variable fnis independent of t he data {(X(i), Y(i))}ni=k+1,which was not used in training. Hence, we have the second equality. The last inequalityfollows from the result we proved last lecture, which said that the total number of mistakesmade by the Perceptron Algorithm is bounded by R2/δ2. Thus, we conclude that, in thiscase, maximizing the margin minimizes risk.3.3 Max-margin as an optimization problemNext, we cast the problem of maximizing the margin a s an optimization problemmaxδ≥0, θ∈Rdδs.t.Y(i)θ, X(i)kθk2≥ δ, i = 1, . . . , nSince the vector θ is normalized, we can take kθk2= 1/δ and rewrite our problem asminθ∈Rd12kθk22s.t. Y(i)θ, X(i)≥ 1, i = 1, . . . , n.(3.1)Because the objective function is quadratic and the constraints are affine, this optimizationproblem is called a quadratic program (QP). We refer to (3.1) as the primal problem. A pointθ ∈ Rdis feasible for the pro blem (3.1) if it satisfies all the constraints. Note that a feasiblepoint exists if and only if the data are linearly separable. In the following derivation, weassume that the data are linearly separable.3-2EECS 281B / STAT 241B Lecture 3 — January 28 Spring 20093.3.1 The dual formulationDefine the LagrangianL(θ, α) =12kθk22+nXi=1αi(1 − Y(i)θ, X(i)),where α ≥ 0 (component-wise). The αiare called the dual variables. Ifˆθ is a feasible pointfor problem (3.1), t hen (1 − Y(i)hˆθ, X(i)i) ≤ 0, for i = 1, . . . , n. Hence,supα≥0L(ˆθ, α) =(12kˆθk22, ifˆθ is feasible+∞ otherwise.(3.2)Therefore, computingp∗∆= infθ∈Rdsupα≥0L(θ, α),is the same as solving the primal problem (3.1).By equation (3.2), the following relations hold for all feasibleˆθ ∈ Rdand all α ≥ 0:supα≥0L(ˆθ, α) =12kˆθk22≥ L(ˆθ, α) ≥ infθ∈RdL(θ, α). (3.3)Since this holds for all feasibleˆθ a nd all α ≥ 0, we can take the infimum on the left and thesupremum on the right to getp∗= infθ∈Rdsupα≥0L(θ, α) ≥ supα≥0infθ∈RdL(θ, α)∆= q∗(3.4)The optimizationq∗= supα≥0infθ∈RdL(θ, α) (3.5)is known as the dual problem, and the inequality (3.4) is a result known as weak duality.Because the original problem (3.1) satisfies Slater’s condition, we actually have strong duality(i.e., p∗= q∗).The Lagrangian L(θ, α) is a convex function, so the θ that minimizes L(θ, α) is given by∂L(θ , α)∂θ= θ +nXi=1αi(−Y(i)X(i)) = 0, (3.6)which yields θ =PαiY(i)X(i). Substituting this value of θ into L(θ, α) givesq(α)∆= infθL(θ, α) = −12αT(K ⊙ Y )α + hα , ei ,where K, Y ∈ Rn×nare the Gram matrices given by Kij=X(i), X(j)and Yij= Y(i)Y(j),and e ∈ Rnis the vector o f ones. The symbol ⊙ denotes the Hadamard (element-by-element)product.3-3EECS 281B / STAT 241B Lecture 3 — January 28 Spring 20093.3.2 InterpretationSuppose that (θ∗, α∗) is an o ptimum for the Lagrangian. From equations (3.2) and (3.6), anoptimum point has the following properties:1. θ∗=Pni=1α∗iY(i)X(i).2. If αi> 0, then 1 − Y(i)θ∗, X(i)= 0.The second condition is known as complementary slackness. Together, these two conditionstell us that an optimum θ∗only depends on the data (X(i), Y(i)) where the corresponding α∗iare nonzero. The points X(i)where α∗i> 0 are called support vectors, and they are containedin the hyperplanes {X ∈ Rd| hθ∗, Xi = 1} and {X ∈ Rd| hθ∗, Xi = −1}. This method offinding the maximum-margin classifier is called the support vector machine.3.4 ExtensionWhat if the data Dnare not linearly separable? Observe that for an optimum (θ∗, α∗),hθ∗, Xi =nXi=1α∗iY(i)hX(i), Xi.Hence, we are mainly concerned with the inner product on X . If we extend o ur notion ofinner product, we can use the supp ort vector machine ideas described above to classify datasets that are not linearly separable.Consider the data set shown in Figure 3.1. Clearly, these data are not linearly separable.However, if we apply the map(X1, X2) 7→ (1, X1, X2, X1X2, X21, X22),and take our inner product in R6, rather than R2, we can use the techniques described aboveto determine θ∗∈ R6that correctly classifies the data (see Figure 3.2).This intuition motivates the use of kernel methods, to which we turn in the next lectures.3-4EECS 281B / STAT 241B Lecture 3 — January 28 Spring 2009−3 −2 −1 0 1 2 3−3−2−10123X1X2 Y = −1Y=+1Figure 3.1. A set of data th at is not linearly separable.−3 −2 −1 0 1 2 3−3−2−10123X1X2 ClassifierY = −1Y=+1Figure 3.2. Classifier that separates th e


View Full Document

Berkeley STAT C241B - Lecture 3

Download Lecture 3
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 3 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 3 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?