DOC PREVIEW
Berkeley STAT C241B - Lecture 7

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

STAT 241B / EECS 281B: Advanced Statistical Learning Spring 2009Lecture 7 — February 11Lecturer: Martin Wainwright Scribe: Vivek RamamurthyNote: These lecture notes a re still rough, and have only have been mildly proofread. This is the danger environment.7.1 AnnouncementsHW #2: due Monday February 23.7.2 Outline• Mercer’s characterization• Kernel PCA (dimensionality reduction)7.3 Mercer’s characteriz ationGiven a symmetric and positive semidefinite matrix K ∈ Rd×d, we know from standard linearalgebra that there exist real scalars λ1≥ λ2≥ ··· ≥ λd≥ 0 and vectors {ψi, i = 1, ··· , d}such thatK =dXi=1λiψiψTiIn this decomposition, the vectors {ψi} are eigenvectors, obtained by solving the matrix-vector equationKψ = λψ.Moreover, the {ψi} can be chosen to be an orthonormal system of vectors.We now discuss a generalization of this type of decomposition to the more general settingof linear operators in a Hilbert space. (The matrix is a special case of a linear operator onRd.)Given a Hilbert space H = {f : X → R} of functions, a linear operator T : H → H is amapping such that1. ∀f ∈ H, T (f) ∈ H2. ∀f, g ∈ H, T (f + g) = T (f) + T (g)3. ∀α ∈ R, T (αf) = αT (f)7-1STAT 241B / EECS 281B Lecture 7 — February 11 Spring 20097.3.1 Mercer’s theorem (one variant):Theorem 7.1. Say X ⊆ Rdis compact, and K : X × X → R is continuous, and satisfiesZyZxK2(x, y)dxdy < +∞,ZyZxf(x)K(x, y)f(y)dxdy ≥ 0 ∀f ∈ L2(X) (i.e. a positive semidefinite kernel)where L2(X) = {f :Zf2(x)dx < +∞}Then there exist λ1≥ λ2≥ λ3≥ ··· (all non-negative) and functions {ψi(·) ∈ L2(X), i =1, 2, 3, ···} such thatK(x, y) =∞Xi=1λiψi(x)ψi(y) ∀x, y ∈ XMoreover, the {ψi} are an orthonormal system in L2(X), meaning thathψi, ψjiL2(X)=Zψi(x)ψj(x)dx =(1 if i = j0 otherwise.Remarks: Note that this can be seen as a generalization of the decompositionK(x, y) = xTKy =dXi=1λi(ψTix)(ψTiy)in the finite-dimensional setting. The orthogonality condition is a generalization of the f actthat PSD matrices have an orthogonal set of eigenvectors.Mercer’s theorem is a special case of spectral decomposition theory for self-adjoint, pos-itive operators in Hilbert spaces.7.3.2 Use of Mercer’s TheoremEigenfunctions can be obtained by solving the integral equation:TK(f)(x) :=ZK(x, y)f (y)dy = λf(x)HereTK(f)(·) :=ZK(·, y)f (y)dyis a linear operator on L2(X) → L2(X). (Homework #2 has some instances of this proce-dure.)7-2STAT 241B / EECS 281B Lecture 7 — February 11 Spring 2009We can then use the eigenfunctions thus o bta ined to generate a “feature map” given byΦ : X → l2(N)Here, the feature map Φ maps data x ∈ X to a sequence (a1, a2, ···) ∈ ℓ2(N), whereℓ2(N) = {(a1, a2, ···)|∞Xi=1a2i< +∞}For example, consider the feature map defined as follows:Φ(x) = (pλ1ψ1(x),pλ2ψ2(x), ··· ,pλiψi(x), ···).That is, we map each x ∈ X into a sequence Φ(x) in ℓ2(N).Using Mercer’s decomp osition, if we take the inner product (in ℓ2(N)) between the twosequences Φ(x) and Φ(y), then we recover the kernel functionhΦ(x), Φ(y)il2(N)=∞Xi=1pλiψi(x)pλiψi(y) = K(x, y).7.4 Kernel PCA7.4.1 Quick recap on classical P CAGiven data X(1), ··· , X(n)⊆ Rd, we first compute the sample covariance or correlationmatrix, given bybΣ =1nnXi=1X(i)[X(i)]TThen, we compute the eigenvectors corresponding to the top k ≪ d eigenvalues (in value).Using these eigenvectors, we project data X ∈ Rd, a large space, into Rk, a much smallerspace. Thus, the primary motivation for PCA is achieving a large reduction in the dimen-sionality of the data.To gain some intuition for PCA, consider an idealized ”noisy subspace” generative model,given byx = Vz + wwhere V ∈ Rd×kis fixed, z ∈ Rdis random, and also w ∈ Rdis random. Furthermore, weassume thatE(z) = 0, Cov(z) = α2Ik×kE(w) = 0, Cov(w) = σ2Id×dFinally, we assume that z and w are independent. This gives usCov(x) = Σ = α2VVT+ σ2Id×d7-3STAT 241B / EECS 281B Lecture 7 — February 11 Spring 2009Now, we may think of V as having k orthogonal columns, i.e.,V = (V1, ··· , Vk)We also have thatΣVj= (α2+ σ2)Vji.e., the eigenvectors corresponding to the top k eigenvalues are {V1, ··· , Vk}. Moreover, forfixed d, we have thatkbΣn− Σk2= maxkuk2=1k(bΣn− Σ)uk2→ 0 as n → + ∞where k · k2denotes the spectral radius (max. absolute value over all eigenvalues).7.4.2 Kernel P CA (Scholkopf et. al., 1997)We once again consider an idealized model, this time in f eature space F, which is given byΦ(x) =kXj=1zjΦj+ w (7.1)where Φj∈ F for all j = 1, ··· , k and is fixed, while z ∈ Rkand w ∈ F are both random.Example: Suppo se that we worked with the feature map defined by a polynomial kernelK(x, y) = (1 + hx, yi)mfor x ∈ Rd. In the special case m = 2 and d = 2, one feature mapfor this kernel is given byΦ(x) = (1,√2x1,√2x2,√2x1x2, x21, x22)so thathΦ(x), Φ(y)i = 1 + 2x1y1+ 2x2y2+ 2x1x2y1y2+ x21+ y21+ x22y22= (1 + x1y1+ x2y2)2One par ticular example of the model (7.1) would be1√2x1√2x2√2x1x2x21x22= z1Φ1+ w .This would model the data as lying near to some quadratic surface, determined by the choiceof Φ1∈ R6. For simplicity, let us assume that the generating vectors are ort hono rmalhΦi, ΦjiF= 0 if i 6= j7-4STAT 241B / EECS 281B Lecture 7 — February 11 Spring 2009Now let us define t he covariance o perator associated with the random element Φ(x). Foreach j, we use Φj⊗ Φjto denote a linear operator on F defined as follows: given somef ∈ F, it outputs a new (Φj⊗ Φj)(f) ∈ F, given by(Φj⊗ Φj)(f) = hΦj, f iFΦj.With this definition, the covariance operator is given byCov[Φ(x)] =kXj=1Var(zj) (Φj⊗ Φj) + E[w ⊗ w]Since it is a linear combination of linear operators, it is a lso a linear operator on F.In particular, for any f ∈ F, this covariance operator outputs a new element of F, givenbyCov[Φ(x)](f) =kXj=1Var(zj)hΦj, fiFΦj+ E[w ⊗ w](f)At this point, the intuition underlying KPCA is the same as the intuition underlying PCA.That is, if we knew the functions Φj, then given a new sample, we could:• map it to the feature space via x 7→ Φ(x)• compute its co-ordinates in the linear span of {Φj} by computing the projectionshΦ(x), Φ(x)iFfor j = 1, ··· , k.In pra ctice, we don’t know the {Φj}, but as with ordinary PCA, we can try to estimatethem from


View Full Document

Berkeley STAT C241B - Lecture 7

Download Lecture 7
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 7 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 7 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?