Berkeley STAT C241B - Lecture 7 - D3040040

Home> Schools> University of California, Berkeley> Statistics (STAT) > STAT C241B> Lecture 7

DOC PREVIEW

Berkeley STAT C241B - Lecture 7

School name University of California, Berkeley

Course Stat C241b- Advanced Topics in Learning and Decision Making

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

STAT 241B / EECS 281B: Advanced Statistical Learning Spring 2009Lecture 7 — February 11Lecturer: Martin Wainwright Scribe: Vivek RamamurthyNote: These lecture notes a re still rough, and have only have been mildly proofread. This is the danger environment.7.1 AnnouncementsHW #2: due Monday February 23.7.2 Outline• Mercer’s characterization• Kernel PCA (dimensionality reduction)7.3 Mercer’s characteriz ationGiven a symmetric and positive semidefinite matrix K ∈ Rd×d, we know from standard linearalgebra that there exist real scalars λ1≥ λ2≥ ··· ≥ λd≥ 0 and vectors {ψi, i = 1, ··· , d}such thatK =dXi=1λiψiψTiIn this decomposition, the vectors {ψi} are eigenvectors, obtained by solving the matrix-vector equationKψ = λψ.Moreover, the {ψi} can be chosen to be an orthonormal system of vectors.We now discuss a generalization of this type of decomposition to the more general settingof linear operators in a Hilbert space. (The matrix is a special case of a linear operator onRd.)Given a Hilbert space H = {f : X → R} of functions, a linear operator T : H → H is amapping such that1. ∀f ∈ H, T (f) ∈ H2. ∀f, g ∈ H, T (f + g) = T (f) + T (g)3. ∀α ∈ R, T (αf) = αT (f)7-1STAT 241B / EECS 281B Lecture 7 — February 11 Spring 20097.3.1 Mercer’s theorem (one variant):Theorem 7.1. Say X ⊆ Rdis compact, and K : X × X → R is continuous, and satisfiesZyZxK2(x, y)dxdy < +∞,ZyZxf(x)K(x, y)f(y)dxdy ≥ 0 ∀f ∈ L2(X) (i.e. a positive semidefinite kernel)where L2(X) = {f :Zf2(x)dx < +∞}Then there exist λ1≥ λ2≥ λ3≥ ··· (all non-negative) and functions {ψi(·) ∈ L2(X), i =1, 2, 3, ···} such thatK(x, y) =∞Xi=1λiψi(x)ψi(y) ∀x, y ∈ XMoreover, the {ψi} are an orthonormal system in L2(X), meaning thathψi, ψjiL2(X)=Zψi(x)ψj(x)dx =(1 if i = j0 otherwise.Remarks: Note that this can be seen as a generalization of the decompositionK(x, y) = xTKy =dXi=1λi(ψTix)(ψTiy)in the finite-dimensional setting. The orthogonality condition is a generalization of the f actthat PSD matrices have an orthogonal set of eigenvectors.Mercer’s theorem is a special case of spectral decomposition theory for self-adjoint, pos-itive operators in Hilbert spaces.7.3.2 Use of Mercer’s TheoremEigenfunctions can be obtained by solving the integral equation:TK(f)(x) :=ZK(x, y)f (y)dy = λf(x)HereTK(f)(·) :=ZK(·, y)f (y)dyis a linear operator on L2(X) → L2(X). (Homework #2 has some instances of this proce-dure.)7-2STAT 241B / EECS 281B Lecture 7 — February 11 Spring 2009We can then use the eigenfunctions thus o bta ined to generate a “feature map” given byΦ : X → l2(N)Here, the feature map Φ maps data x ∈ X to a sequence (a1, a2, ···) ∈ ℓ2(N), whereℓ2(N) = {(a1, a2, ···)|∞Xi=1a2i< +∞}For example, consider the feature map defined as follows:Φ(x) = (pλ1ψ1(x),pλ2ψ2(x), ··· ,pλiψi(x), ···).That is, we map each x ∈ X into a sequence Φ(x) in ℓ2(N).Using Mercer’s decomp osition, if we take the inner product (in ℓ2(N)) between the twosequences Φ(x) and Φ(y), then we recover the kernel functionhΦ(x), Φ(y)il2(N)=∞Xi=1pλiψi(x)pλiψi(y) = K(x, y).7.4 Kernel PCA7.4.1 Quick recap on classical P CAGiven data X(1), ··· , X(n)⊆ Rd, we first compute the sample covariance or correlationmatrix, given bybΣ =1nnXi=1X(i)[X(i)]TThen, we compute the eigenvectors corresponding to the top k ≪ d eigenvalues (in value).Using these eigenvectors, we project data X ∈ Rd, a large space, into Rk, a much smallerspace. Thus, the primary motivation for PCA is achieving a large reduction in the dimen-sionality of the data.To gain some intuition for PCA, consider an idealized ”noisy subspace” generative model,given byx = Vz + wwhere V ∈ Rd×kis fixed, z ∈ Rdis random, and also w ∈ Rdis random. Furthermore, weassume thatE(z) = 0, Cov(z) = α2Ik×kE(w) = 0, Cov(w) = σ2Id×dFinally, we assume that z and w are independent. This gives usCov(x) = Σ = α2VVT+ σ2Id×d7-3STAT 241B / EECS 281B Lecture 7 — February 11 Spring 2009Now, we may think of V as having k orthogonal columns, i.e.,V = (V1, ··· , Vk)We also have thatΣVj= (α2+ σ2)Vji.e., the eigenvectors corresponding to the top k eigenvalues are {V1, ··· , Vk}. Moreover, forfixed d, we have thatkbΣn− Σk2= maxkuk2=1k(bΣn− Σ)uk2→ 0 as n → + ∞where k · k2denotes the spectral radius (max. absolute value over all eigenvalues).7.4.2 Kernel P CA (Scholkopf et. al., 1997)We once again consider an idealized model, this time in f eature space F, which is given byΦ(x) =kXj=1zjΦj+ w (7.1)where Φj∈ F for all j = 1, ··· , k and is fixed, while z ∈ Rkand w ∈ F are both random.Example: Suppo se that we worked with the feature map defined by a polynomial kernelK(x, y) = (1 + hx, yi)mfor x ∈ Rd. In the special case m = 2 and d = 2, one feature mapfor this kernel is given byΦ(x) = (1,√2x1,√2x2,√2x1x2, x21, x22)so thathΦ(x), Φ(y)i = 1 + 2x1y1+ 2x2y2+ 2x1x2y1y2+ x21+ y21+ x22y22= (1 + x1y1+ x2y2)2One par ticular example of the model (7.1) would be1√2x1√2x2√2x1x2x21x22= z1Φ1+ w .This would model the data as lying near to some quadratic surface, determined by the choiceof Φ1∈ R6. For simplicity, let us assume that the generating vectors are ort hono rmalhΦi, ΦjiF= 0 if i 6= j7-4STAT 241B / EECS 281B Lecture 7 — February 11 Spring 2009Now let us define t he covariance o perator associated with the random element Φ(x). Foreach j, we use Φj⊗ Φjto denote a linear operator on F defined as follows: given somef ∈ F, it outputs a new (Φj⊗ Φj)(f) ∈ F, given by(Φj⊗ Φj)(f) = hΦj, f iFΦj.With this definition, the covariance operator is given byCov[Φ(x)] =kXj=1Var(zj) (Φj⊗ Φj) + E[w ⊗ w]Since it is a linear combination of linear operators, it is a lso a linear operator on F.In particular, for any f ∈ F, this covariance operator outputs a new element of F, givenbyCov[Φ(x)](f) =kXj=1Var(zj)hΦj, fiFΦj+ E[w ⊗ w](f)At this point, the intuition underlying KPCA is the same as the intuition underlying PCA.That is, if we knew the functions Φj, then given a new sample, we could:• map it to the feature space via x 7→ Φ(x)• compute its co-ordinates in the linear span of {Φj} by computing the projectionshΦ(x), Φ(x)iFfor j = 1, ··· , k.In pra ctice, we don’t know the {Φj}, but as with ordinary PCA, we can try to estimatethem from

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

Berkeley STAT C241B - Lecture 7

Sign up for free to view:

Please select your school