##
This **preview** shows page *1-2-20-21*
out of 21 **pages**.

*View Full Document*

End of preview. Want to read all 21 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document**Unformatted text preview:**

INTRODUCTIONTOMachineLearningETHEM ALPAYDIN© The MIT Press, [email protected]://www.cmpe.boun.edu.tr/~ethem/i2mlLecture Slides forCHAPTER6:DimensionalityReductionLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)3Why Reduce Dimensionality?1. Reduces time complexity: Less computation2. Reduces space complexity: Less parameters3. Saves the cost of observing the feature4. Simpler models are more robust on small datasets5. More interpretable; simpler explanation6. Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3 dimensionsLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)4Feature Selection vs Extraction Feature selection: Choosing k<d important features, ignoring the remaining d – kSubset selection algorithms Feature extraction: Project the original xi, i =1,...,d dimensions to new k<d dimensions, zj, j =1,...,kPrincipal components analysis (PCA), linear discriminant analysis (LDA), factor analysis (FA)Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)5Subset Selection There are 2dsubsets of d features Forward search: Add the best feature at each step Set of features F initially Ø. At each iteration, find the best new featurej = argminiE ( F ∪ xi) Add xjto F if E ( F ∪ xj) < E ( F ) Hill-climbing O(d2) algorithm Backward search: Start with all features and remove one at a time, if possible. Floating search (Add k, remove l)Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)6Principal Components Analysis (PCA) Find a low-dimensional space such that when x is projected there, information loss is minimized. The projection of x on the direction of w is: z = wTx Find w such that var(z) is maximizedvar(z) = var(wTx) = E[(wTx – wTµ)2] = E[(wTx – wTµ)(wTx – wTµ)]= E[wT(x – µ)(x – µ)Tw]= wT E[(x – µ)(x –µ)T]w = wT∑ wwhere var(x)= E[(x – µ)(x –µ)T] = ∑Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)7 Maximize var(z) subject to ||w||=1∑ w1= α w1that is, w1is an eigenvector of ∑Choose the one with the largest eigenvalue for var(z) to be max Second principal component: Max var(z2), s.t., ||w2||=1 and orthogonal to w1∑ w2= α w2that is, w2is another eigenvector of ∑ and so on.Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)8What PCA doesz = WT(x – m)where the columns of W are the eigenvectors of ∑, and m is sample meanCenters the data at the origin and rotates the axesLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)9How to choose k ? Proportion of variance (PoV) explainedwhen λi are sorted in descending order Typically, stop at PoV>0.9 Scree graph plots of PoV vs k, stop at “elbow”Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)10Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)11Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)12Factor Analysis Find a small number of factors z, which when combined generate x :xi– µi= vi1z1+ vi2z2+ ... + vikzk+ εiwhere zj, j =1,...,k are the latent factors with E[ zj ]=0, Var(zj)=1, Cov(zi ,, zj)=0, i ≠ j , εiare the noise sourcesE[ εi]= ψi, Cov(εi, εj) =0, i ≠ j, Cov(εi, zj) =0 ,and vijare the factor loadingsLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)13PCA vs FA PCA From x to zz = WT(x – µ) FA From z to x x – µ = Vz + εxzzxLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)14Factor Analysis In FA, factors zjare stretched, rotated and translated to generate xLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)15Multidimensional Scaling Given pairwise distances between N points, dij, i,j =1,...,Nplace them on a low-dimensional map such that the distances are preserved. z = g (x | θ )Find θ that min Sammon stressLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)16Map of Europe by MDSMapfromCIA–TheWorldFactbook:http://www.cia.gov/Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)17Linear Discriminant Analysis Find a low-dimensional space such that when xis projected, classes are well-separated. Find w that maximizesLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)18 Between-class scatter Within-class scatterwhereLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)19 Find w that max LDA soln: Parametric soln:Fisher’s Linear DiscriminantLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)20K>2 Classes Within-class scatter: Between-class scatter: Find W that maxLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press

View Full Document