DOC PREVIEW
Berkeley COMPSCI C281B - Regression, the SVD, and PCA

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS281B/Stat241B: Advanced Topics in Learning & Decision MakingRegression, the SVD, and PCALecturer: Michael I. Jordan Scribes: Brian Vogel1 Support Vector RegressionBefore discussing support vector (SV) regression, let us first consider a few possible loss functions for thelinear regression problemyi= wTφ(xi) + iLeast squares regression minimizes the quadratic loss:minwnXi=12iRidge regression adds a penalization term for the L2norm of the parameter vector w:minwnXi=12i+ λwTwLasso regression adds a penalization term for the L1norm of the parameter vector:minwnXi=12i+ λnXj=1|wj|For SV regression, we will make use of the  insensitive loss, which does not penalize errors below some  ≥ 0Figure 1 shows the  insensitive loss function. We introduce slack variables ξiand˜ξi. We have the followingoptimization problem:minwwTw + CnXi=1(ξi+˜ξi)such thatyi− wTφ(xi) − b ≤  + ξiwTφ(xi) + b − yi≤  +˜ξi˜ξi≥ 0, ξi≥ 0Here, C controls the amount of penalization for points lying outside the  tube.12 Regression, the SVD, and PCAε insensitive loss−ε εFigure 1: The  insensitive loss function for SV regression.1.1 Dual ProblemWriting the Lagrangian, we get the following optimization problem:maxαi,ˆαiXi( ˆαi− αi)yi− Xi( ˆαi+ αi) −12Xi,j( ˆαi− αi)( ˆαj− αj)hφ(xi), φ(xj)isuch that0 ≤ αi, ˜αi≤ CXi(˜αi − αi) = 0Note that this is a quadratic program. Note also that this is kernalized, so we can use k(xi, xj) =hφ(xi), φ(xj)i instead of explicitly computing the inner product in the feature space.1.2 KKT conditionsαi(hw, φ(xi)i + b − yi−  + ξi) = 0˜αi(hw, φ(xi)i + b − yi−  +˜ξi) = 0From the KKT conditions, we also have thatξi=˜ξi= 0αi= ˜αi= 0(αi− C)ξi= 0( ˜αi− C)˜ξi= 0Note that if the ξiare strictly positive, then αi= C.1.3 Kernel PCAQuestion: Is finding eigenvectors a convex problem? Ans: No, but eigenvector problems can be efficientlysolved nonetheless. PCA is an eigenvector problem.We would like to perform PCA in the feature space. Recall that for PCA, we need to find the eigenvectorsof XTX.Regression, the SVD, and PCA 31.3.1 The Singular Value Decomposition (SVD)A symmetric matrix has real eigenvalues and its eigenvectors can be chosen to be orthonormal. Any symmet-ric matrix A can be factorized as A = M ΛMT. Here, M is an orthogonal matrix containing the eigenvectorsof A as its columns. Λ is the diagonal matrix containing the eigenvalues of A.To see this, right-multiply by M to obtainAM = MΛAx = Λxwhere x is a column of M.Now if A also happens to be psd, then all of its eigenvalues will be non negative (i.e., Λ ≥ 0).Any m by n matrix A can be factored into the singular value decompositionA = UΣVTwhere the columns of the m by m orthogonal matrix U are the eigenvectors of AAT. The columns of the n byn orthogonal matrix V are the eigenvectors of ATA. The m by n diagonal matrix Σ has the singular valuesalong its diagonal. The singular values, σi, are all nonnegative, and are the square roots of the eigenvaluesof both AATand ATA.Consider the design matrix X, with rows consisting of the feature vectors.Let XT= UΣVT. Recall that our kernel matrix is given byK = XXT= V ΣUTUΣVT= V Σ2VTSo, the columns of V are the eigenvectors of K. Moreover, the elements of Σ2are eigenvalues of K. Whatis the relationship between U and XXT?C4= XXT= UΣVTV ΣUT= UΣ2UTSo, col(U ) are the eigenvectors of C and diag(Σ2) are the eigenvalues of C. Note that K and C share thesame eigenvalues. We have that U Σ = XTV . So, the j’th column of U can be expressed asuj= λ−1/2XTV =nXi=1αjiφ(xi), αj= λ−1/2jvjSo, we only need to calculate the eigenvectors of one of these matrices. Note also that the eigenvectors ofthe correlation matrix can be expressed as a linear combination of the feature vectors, φ(xi).2 Principal Components Analysis (PCA)The idea of PCA is to find the direction such that the variance of the data projected onto that directionis maximized. Let’s suppose that we are given centered data xi. That is,Pixi= 0 If we were not givencentered data, then the first step would be to translate the data to the mean.The variance of the projected values is maximized:4 Regression, the SVD, and PCAmax||w=1||V ar(wTx)⇔ max||w1||=1E(wTx − wTµ)2⇔ max||w||=1E(2Tx − wTµ)(xTw − µTw)⇔ max||w||=1wTE(x − µ)(x − µ)Tw⇔ max||w||=1wTΣWWhich is quadratic with the constraint that ||w|| = 1.The eigenvector with the biggest eigenvalue is the solution. To see this, write the Lagrangian.L(wT, λ) = wTΣw + λ(1 − wTw)∂L∂w= 0 ⇒ 2Σw − 2λw = 0⇒ Σw = λwSo the eigenvector with the biggest eigenvalue is the


View Full Document

Berkeley COMPSCI C281B - Regression, the SVD, and PCA

Download Regression, the SVD, and PCA
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Regression, the SVD, and PCA and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Regression, the SVD, and PCA 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?