1CS 559: Machine LearningCS 559: Machine Learning Fundamentals and Applications7thSet of NotesInstructor: Philippos MordohaiWebpage: www.cs.stevens.edu/~mordohaiEmail:Philippos Mordohai@stevens eduE-mail: [email protected]: Lieb 215Project ProposalProject Proposal•Project titleProject title• Data set•Project idea: What is the objective whatProject idea: What is the objective, what method(s) will be tested?• Relevant papers pp– Optional, but a good strategy• Software you plan to writeyp• Experiments you plan to do• Has to be approved by me before October 19pp y2OverviewOverview•Week 6 NotesWeek 6 Notes– Hidden Markov Models (DHS Sec. 3.10-3105)3.10.5)– Principal Component Analysis (DHS Sec. 3.8.1–essentially notes only)3.8.1 essentially notes only)• Intro• Singular Value Decomposition• Dimensionality Reduction - PCA in practice (this set of notes)pac ce( sse o oes)3Dimensionality ReductionCarlos Guestrin4Motivation: Dimensionality ReductionMotivation: Dimensionality Reduction• Input data may have thousands or millions of pydimensions!– text data have thousands of wordsi d t h illi f i l–image data have millions of pixels• Dimensionality reduction: represent data with fewer dimensionsfewer dimensions– Easier learning – fewer parameters– Visualization – hard to visualize more than 3D or 4D–Discover “intrinsic dimensionality” of data for high dimensional data that is truly lower dimensional (e.g. identity of objects in image << number of pixels)yj g )C. Guestrin 5Feature SelectionFeature Selection•Given set of featuresX=<X1X>Given set of features X=<X1,…,Xn>• Some features are more important than othersothers•Approach: select subset of features to be used by learning algorithm– Score each feature (or sets of features)–Select set of features with best scoreC. Guestrin 6Greedy Forward Feature SelectionGreedy Forward Feature Selection•Greedy heuristic:Greedy heuristic:– Start from empty (or simple) set of features F0=Ø–Run learning algorithm for current set of featuresRun learning algorithm for current set of features Ft–Select next best feature Xii• e.g., one that results in lowest error when learning with F–Ft+1 – RecurseC. Guestrin 7Greedy Backward Feature SelectionGreedy Backward Feature Selection•Greedy heuristic:Greedy heuristic:– Start from set of all features F0= FRun learning algorithm for current set of–Run learning algorithm for current set of features Ft–Select next worst feature X–Select next worst feature Xi• e.g., one that results in lowest error when learning with Ft-{Xi}t{i}–Ft+1 Ft-{Xi}–RecurseC. Guestrin 8Lower Dimensional ProjectionsLower Dimensional Projections•How would this work for the ball-springHow would this work for the ballspring example?• Rather than picking a subset of the fdifhfeatures, we can derive new features that are combinations of existing featuresC. Guestrin 9ProjectionProjection•Given m data points: xi=(x1ixi)i=1 mGiven m data points: x= (x1,…,xn), i=1…m• Represent each point as a projection:•If k=n then projected data are equivalentIf kn, then projected data are equivalent to original dataC. Guestrin 10PCAPCA• PCA finds projection that minimizes pjreconstruction error– Reconstruction error: norm of distance between original and projected dataoriginal and projected data• Given k≤n, find (u1,…,uk) minimizing reconstruction error:• Error depends on k+1..n unused basis vectorsC. Guestrin 11Basic PCA AlgorithmBasic PCA Algorithm•Start fromm×ndata matrixXStart from mndata matrix X–m data points (samples over time)–nmeasurement types • Re-center: subtract mean from each row of X• Compute covariance matrix:p– Σ=XcTXc• Compute eigenvectors and eigenvalues of ΣNote: Covariance matrix is n×n (measurement types)(But there may be exceptions)• Principal components: k eigenvectors with highest eigenvaluesC. Guestrin 12SVDSVD•Efficiently finds top k eigenvectorsEfficiently finds top k eigenvectors – Much faster than eigen-decomposition• Write X = U S VT– X: data matrix, one row per datapoint– U: weight matrix, one row per datapoint –di fiiicoordinates of xiin eigen-space– S: singular value matrix, diagonal matrix•in our setting each entry iseigenvalueλofΣ•in our setting each entry is eigenvalueλjof Σ– VT: singular vector matrix• in our setting each row is eigenvector vjof ΣC. Guestrin 13Using PCA for Dimensionality Reduction• Given set of features X=<X1,…,Xn>• Some features are more important than pothers – Reduce noise and redundancyAl id•Also consider:– Rotation• Approach: Use PCA on X to select a few important features• Then, apply a classification technique in reduced spaceC. Guestrin
View Full Document