Berkeley STAT C241B - Statistical Properties of Kernel Principal Component Analysis - D2825620

Home> Schools> University of California, Berkeley> Statistics (STAT) > STAT C241B> Statistical Properties of Kernel Principal Component Analysis

DOC PREVIEW

Berkeley STAT C241B - Statistical Properties of Kernel Principal Component Analysis

School name University of California, Berkeley

Course Stat C241b- Advanced Topics in Learning and Decision Making

Pages 39

This preview shows page 1-2-3-18-19-37-38-39 out of 39 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Statistical Properties of Kernel PrincipalComponent AnalysisGilles Blanchard1& Olivier Bousquet2& Laurent Zwald31Fraunhofer FIRST (IDA), K´ekul´estr. 7, D-12489 Berlin, Germany.2Pertinence, France.3D´epartement de Math´ematiques, Universit´e Paris-Sud, Bat.425, F-91405, France.AbstractThe main goal of this paper is to prove inequalities on the reconstruction error forKernel Principal Component Analysis. With respect to previous work on this topic,our contribution is twofold: (1) we give bounds that explicitly take into accountthe empirical centering step in this algorithm, and (2) we show that a “localized”approach allows to obtain more accurate bounds. In particular, we show faster ratesof convergence towards the minimum reconstruction error; more precisely, we provethat the convergence rate can typically be faster than n−1/2. We also obtain a newrelative bound on the error.A secondary goal, for which we present similar contributions, is to obtain con-vergence bounds for the partial sums of the biggest or smallest eigenvalues of thekernel Gram matrix towards eigenvalues of the corresponding kernel operator. Thesequantities are naturally linked to the KPCA procedure; furthermore these results canhave applications to the study of various other kernel algorithms.The results are presented in a functional analytic framework, which is suited todeal rigorously with reproducing kernel Hilbert spaces of infinite dimension.1 Introduction1.1 Goals of this paperThe main focus of this work is Principal Component Analysis (PCA), and its ’kernelized’variant, Kernel PCA (KPCA). PCA is a linear projection method giving as an outputa sequence of nested linear subspaces which are adapted to the data at hand. This is awidely used preprocessing method with diverse applications, ranging from dimensionalityreduction to denoising. Various extensions of PCA have been explored; applying PCA toa space of functions rather than a space of vectors was first proposed by Besse (1979) (seealso the survey of Ramsay and Dalzell, 1991). Kernel PCA (Sch¨olkopf, Smola, and M¨uller,1999) is an instance of such a method which has boosted the interest in PCA as it allowsto overcome the limitations of linear PCA in a very elegant manner by mapping the datato a high-dimensional feature space.For any fixed d, PCA finds a linear subspace of dimension d such that the data linearlyprojected onto it have maximum variance. This is obtained by performing an eigendecom-position of the empirical covariance matrix and considering the span of the eigenvectorscorresponding to the leading eigenvalues. This sets the eigendecomposition of the true1covariance matrix as a natural ’idealized’ goal of PCA and begs the question of the re-lationship between this goal and what is obtained empirically. However, despite being arelatively old and commonly used technique, little has been done on analyzing the statisti-cal performance of PCA. Most of the previous work has focused on the asymptotic behaviorof empirical covariance matrices of Gaussian vectors (e.g., Anderson, 1963). Asymptoticresults for PCA have been obtained by Dauxois and Pousse (1976), and Besse (1991) inthe case of PCA in a Hilbert space.There is, furthermore, an intimate connection between the covariance operator andthe Gram matrix of the data, and in particular between their spectra. In the case ofKPCA, this is a crucial point at two different levels. From a practical point of view, thisconnection allows to reduce the eigendecomposition of the (infinite dimensional) empiricalkernel covariance operator to the eigendecomposition of the kernel Gram matrix, whichmakes the algorithm feasible. From a theoretical point of view, it provides a bridge betweenthe spectral properties of the kernel covariance and those of the so-called kernel integraloperator.Therefore, theoretical insight on the properties of Kernel PCA reaches beyond thisparticular algorithm alone: it has direct consequences for understanding the spectral prop-erties of the kernel matrix and the kernel operator. This makes a theoretical study ofKernel PCA all the more interesting: the kernel Gram matrix is a central object in allkernel-based methods and its spectrum often plays an important role when studying var-ious kernel algorithms; this has been shown in particular in the case of Support VectorMachines (Williamson, Smola, and Sch¨olkopf, 2001). Understanding the behavior of eigen-values of kernel matrices, their stability and how they relate to the eigenvalues of the corre-sponding kernel integral operator is thus crucial for understanding the statistical propertiesof kernel-based algorithms.Asymptotical convergence and central limit theorems for estimation of integral oper-ator eigenspectrum by the spectrum of its empirical counterpart have been obtained byKoltchinskii and Gin´e (2000). Recent work of Shawe-Taylor, Williams, Cristianini, andKandola (2002, 2005) (see also the related work of Braun, 2005) has put forward a finite-sample analysis of the properties of the eigenvalues of kernel matrices and related it to thestatistical performance of Kernel PCA. Our goal in the present work is mainly to extendthe latter results in two different directions:• In practice, for PCA or KPCA, an (empirical) recentering of the data is generallyperformed. This is because PCA is viewed as a technique to analyze the variance ofthe data; it is often desirable to treat the mean independently as a preliminary step(although, arguably, it is also feasible to perform PCA on uncentered data). Thiscentering was not considered in the cited previous work while we take this step intoaccount explicitly and show that it leads to comparable convergence properties.• to control the estimation error, Shawe-Taylor et al. (2002, 2005) use what we wouldcall a global approach which typically leads to convergence rates of order n−1/2. Nu-merous recent theoretical works on M-estimation have shown that improved rates canbe obtained by using a so-called local approach, which very coarsely speaking consists2in taking the estimation error variance precisely into account. We refer the readerto the works of Massart (2000), Bartlett, Bousquet, and Mendelson (2005), Bartlett,Jordan, and McAuliffe (2003), Koltchinskii (2004) (among others). Applying thisprinciple to the analysis of PCA, we show that it leads to improved bounds.Note that we consider these two types of extension separately, not simultaneously. Whilewe believe it

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-18-19-37-38-39 out of 39 pages.

Berkeley STAT C241B - Statistical Properties of Kernel Principal Component Analysis

Sign up for free to view:

Please select your school