Berkeley COMPSCI C281B - Regression, the SVD, and PCA - D2093682

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C281B> Regression, the SVD, and PCA

DOC PREVIEW

Berkeley COMPSCI C281B - Regression, the SVD, and PCA

School name University of California, Berkeley

Course Compsci C281b- Advanced Topics in Learning and Decision Making

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS281B/Stat241B: Advanced Topics in Learning & Decision MakingRegression, the SVD, and PCALecturer: Michael I. Jordan Scribes: Brian Vogel1 Support Vector RegressionBefore discussing support vector (SV) regression, let us first consider a few possible loss functions for thelinear regression problemyi= wTφ(xi) + iLeast squares regression minimizes the quadratic loss:minwnXi=12iRidge regression adds a penalization term for the L2norm of the parameter vector w:minwnXi=12i+ λwTwLasso regression adds a penalization term for the L1norm of the parameter vector:minwnXi=12i+ λnXj=1|wj|For SV regression, we will make use of the  insensitive loss, which does not penalize errors below some  ≥ 0Figure 1 shows the  insensitive loss function. We introduce slack variables ξiand˜ξi. We have the followingoptimization problem:minwwTw + CnXi=1(ξi+˜ξi)such thatyi− wTφ(xi) − b ≤  + ξiwTφ(xi) + b − yi≤  +˜ξi˜ξi≥ 0, ξi≥ 0Here, C controls the amount of penalization for points lying outside the  tube.12 Regression, the SVD, and PCAε insensitive loss−ε εFigure 1: The  insensitive loss function for SV regression.1.1 Dual ProblemWriting the Lagrangian, we get the following optimization problem:maxαi,ˆαiXi( ˆαi− αi)yi− Xi( ˆαi+ αi) −12Xi,j( ˆαi− αi)( ˆαj− αj)hφ(xi), φ(xj)isuch that0 ≤ αi, ˜αi≤ CXi(˜αi − αi) = 0Note that this is a quadratic program. Note also that this is kernalized, so we can use k(xi, xj) =hφ(xi), φ(xj)i instead of explicitly computing the inner product in the feature space.1.2 KKT conditionsαi(hw, φ(xi)i + b − yi−  + ξi) = 0˜αi(hw, φ(xi)i + b − yi−  +˜ξi) = 0From the KKT conditions, we also have thatξi=˜ξi= 0αi= ˜αi= 0(αi− C)ξi= 0( ˜αi− C)˜ξi= 0Note that if the ξiare strictly positive, then αi= C.1.3 Kernel PCAQuestion: Is finding eigenvectors a convex problem? Ans: No, but eigenvector problems can be efficientlysolved nonetheless. PCA is an eigenvector problem.We would like to perform PCA in the feature space. Recall that for PCA, we need to find the eigenvectorsof XTX.Regression, the SVD, and PCA 31.3.1 The Singular Value Decomposition (SVD)A symmetric matrix has real eigenvalues and its eigenvectors can be chosen to be orthonormal. Any symmet-ric matrix A can be factorized as A = M ΛMT. Here, M is an orthogonal matrix containing the eigenvectorsof A as its columns. Λ is the diagonal matrix containing the eigenvalues of A.To see this, right-multiply by M to obtainAM = MΛAx = Λxwhere x is a column of M.Now if A also happens to be psd, then all of its eigenvalues will be non negative (i.e., Λ ≥ 0).Any m by n matrix A can be factored into the singular value decompositionA = UΣVTwhere the columns of the m by m orthogonal matrix U are the eigenvectors of AAT. The columns of the n byn orthogonal matrix V are the eigenvectors of ATA. The m by n diagonal matrix Σ has the singular valuesalong its diagonal. The singular values, σi, are all nonnegative, and are the square roots of the eigenvaluesof both AATand ATA.Consider the design matrix X, with rows consisting of the feature vectors.Let XT= UΣVT. Recall that our kernel matrix is given byK = XXT= V ΣUTUΣVT= V Σ2VTSo, the columns of V are the eigenvectors of K. Moreover, the elements of Σ2are eigenvalues of K. Whatis the relationship between U and XXT?C4= XXT= UΣVTV ΣUT= UΣ2UTSo, col(U ) are the eigenvectors of C and diag(Σ2) are the eigenvalues of C. Note that K and C share thesame eigenvalues. We have that U Σ = XTV . So, the j’th column of U can be expressed asuj= λ−1/2XTV =nXi=1αjiφ(xi), αj= λ−1/2jvjSo, we only need to calculate the eigenvectors of one of these matrices. Note also that the eigenvectors ofthe correlation matrix can be expressed as a linear combination of the feature vectors, φ(xi).2 Principal Components Analysis (PCA)The idea of PCA is to find the direction such that the variance of the data projected onto that directionis maximized. Let’s suppose that we are given centered data xi. That is,Pixi= 0 If we were not givencentered data, then the first step would be to translate the data to the mean.The variance of the projected values is maximized:4 Regression, the SVD, and PCAmax||w=1||V ar(wTx)⇔ max||w1||=1E(wTx − wTµ)2⇔ max||w||=1E(2Tx − wTµ)(xTw − µTw)⇔ max||w||=1wTE(x − µ)(x − µ)Tw⇔ max||w||=1wTΣWWhich is quadratic with the constraint that ||w|| = 1.The eigenvector with the biggest eigenvalue is the solution. To see this, write the Lagrangian.L(wT, λ) = wTΣw + λ(1 − wTw)∂L∂w= 0 ⇒ 2Σw − 2λw = 0⇒ Σw = λwSo the eigenvector with the biggest eigenvalue is the

View Full Document