UCI ICS 273A - LECTURE NOTES - D1919324

Home> Schools> University of California, Irvine> (ICS) > ICS 273A> LECTURE NOTES

DOC PREVIEW

UCI ICS 273A - LECTURE NOTES

School name University of California, Irvine

Course Ics 273a- Machine Learning

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

ICS 273A: Support Vector Machines Winter 2008Lecture 10 — February 11Scribe: Avi Mehta Lecturer: Deva RamananNote: These lecture notes are rough, and have only have been mildly proofread.10.1 Lagrangian duality10.1.1 RecallWe are working with a dataset that is linearly separable. We want to maximize the minimumdistance of a data points from the separator. Mathematically, it can be formulated as:min12kwk2(10.1)s.t. y(i)(wTx(i)+ b) ≥ 1 (10.2)These equations are of the form:min f(w) (10.3)s.t. gi(w) ≤ 0 (10.4)In last lecture, we had claimed that the equations 10.3 and 10.4 above can b e rewritten inthe form:minwmaxα≥0L(w, α), where (10.5)L(w, α) = f(w) +Xαigi(w) (10.6)10.1.2 Weak versus strong dualtyFor all f(w) and gi(w), the following holds:maxα≥0minwL(w, α) ≤ minwmaxα≥0L(w, α) (10.7)The above property, called weak duality, was proven in the previous lecture notes. Undercertain conditions on f(w) and g(w), equality holds and the problem is said to exhibit strongduality. If f (w) is convex and gi(w) are affine then strong duality holds. Alternatively, iff(w) is convex and gi(w) are convex and there exits at least one feasible w, then equalityalso holds.10-1ICS 273A Lecture 10 — February 11 Winter 200810.2 Support Vector MachinesApplying above equations to the case of Support Vector Machines, we have:maxα≥0minw,bL(w, b, α)(10.8)(10.9)In the above equation, the term inside square brackets can then be easily minimized over wand b for each value of α. Thus, optimizing based on these principles, we get:∂L∂w= 0 ⇒ w =Xiαiy(i)x(i)(10.10)∂L∂w= 0 ⇒Xiαiy(i)= 0 (10.11)Now we have the final problem of maximization of expression over all α ≥ 0. This maxi-mization can be written as:maxα≥0Xiαi−12Xi,jαiαjy(i)y(j)x(i)Tx(j)(10.12)s.t.Xiαiy(i)= 0 (10.13)10.2.1 KKT ConditionsFor the ithpoint in a given dataset, we have αigi(w) = 0. This means either αi= 0 orgi(w) = 0. This implies the following conditionsif αi= 0 then y(i)(wTx(i)+ b) ≥ 1 (10.14)if αi> 0 then y(i)(wTx(i)+ b) = 1 (10.15)In the ab ove formulation, The α0is > 0 are called support vectors because they are theactive constraints that determine the final decision boundary. Intuitively, these are pointsat the margin whose contraints are active at the final solution.Consider a given a dataset {x(i), y(i)} and the w∗that maximizes the margin for that data.One could remove all the points strictly beyond the margin y(i)(wTx(i)+ b) > 1, recomputethe solution, and will obtain the same w∗.10-2ICS 273A Lecture 10 — February 11 Winter 200810.2.2 ObservationsWe can divide the set of all α into two parts corresponding to positive and negative trainingexamples. Thus we have,w =Xiαiy(i)x(i)(10.16)=Xi∈posαix(i)−Xi∈negαix(i)(10.17)The above implies the final weight vector w lies in the span of the data points x(i). We cansimilarly split up the terms for the following constraintXiαiy(i)= 0 (10.18)⇒Xi∈posαi=Xi∈negαi(10.19)The above equations imply that the influence of the points on the positive and the negativeside of the dec ision boundary are equal.It is useful to relate an SVM to a class-conditional guassian model. Recall that, forspherically-distributed classes, a class-conditional gaussian model fit a decision boundarywhose normal w is a line connecting the average class 0 point to the average class 1 point.An SVM does something similar, but computes a weighted average of points at the boundary.Specifically, the weights are given by αiand the boundary points are the support vectors.10.2.3 SummarizingWe rewrite our optimization problem in dual form because:1) We can examine useful properties of the final solution, such as sparsity and the balancedinfluence of positives and negatives. 2) We can exploit sparsity during the optimization. Wewill later discuss an algorithm (SMO) that does this. 3) We can apply kernel ”tricks” tolearn linear boundaries in very high dimensional feature spaces. We discuss it in followingsection.10.3 KernelsRecall that we have:h(w) = wTx + b (10.20)= (Piαiy(i)x(i)T)x + b (10.21)=Piαiy(i)< x(i)T, x(i)> +b where < x, z >= xTz. (10.22)10-3ICS 273A Lecture 10 — February 11 Winter 2008Observation: To both train the model and use it, we don’t need the actual data points.We just need to be able to compute inner products between vectors < x(i)T, x(i)>.Kernel trick: Replace < x, z > with k(x, z) where k is a ernel function that correspondsto an inner product in some feature space - i.e., k(x, z) =< φ(x), φ(z) >. For example, hereis one valid kernel functionk(x, z) = (xTz)2(10.23)What is the feature space that corresponds to this kernel? We can answer that by expandingthe RHS (assuming x, z ∈ R2)x1 x2 z1z2= (x1z1+ x2k2)2(10.24)= x21z21+ 2x1x2z1z2+ x22z22(10.25)=x21√2x1x2x22 z21√2z1z2z22(10.26)= φ(x)Tφ(z) (10.27)A linear boundary in the feature space φ(x) can represent a quadratic boundary in theoriginal space. For example, a circle can be represented as wTφ(x) + b = 0. See figure (10.3).. We can also use kernels that map into a feature scape φ(x) is that a large number, or evenan infinite number of dimensions. Some common kernels are given belowk(x, z) = (1 + xTZ)nPolynomial kernel (10.28)k(x, z) = e−(kX−Zk)2σ)Gaussian kernal (10.29)The Gaussian kernal uses a feature space that is infinite dimensional. The best way to thinkabout that is for any given dataset the guassian kernal, with an appropriate σ, can find a10-4ICS 273A Lecture 10 — February 11 Winter 2008linear separator. This is because it has an infinite choice of dimensions to split the datawith.A given function k(x, z) is a valid kernel if and only if it corresponds to an inner productin some space. This is equivalent to satifying Mercer’s conditions1. k is symmetric - k(x, z) = k(z, x)2. For any set of points x(i). . . x(m), the associated kernel matrix K where Kij= k(x(i), x(j))is positive semidefinite - vTKv ≥ 0 for any v.One can show the second condition is necessary byvTKv =PiPjvTiKijvj(10.30)=PiPjvTiφ(x(i))Tφ(x(j))vj(10.31)=PivTiφ(x(i))TPjφ(x(j))vj(10.32)=Piviφ(x(i))2≥ 0 (10.33)We can define kernels on things that do not even live in vector space. For example,consider strings.x1= “The boy ran home.”x2= “Boys and girls.”k(x1, x2) = number of substrings in common between x1, x2.We can show that its a valid kernel since it is a dot product between vectors of indicatorfor all possible strings.Note: Many

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

UCI ICS 273A - LECTURE NOTES

Sign up for free to view:

Please select your school