UCI ICS 278 - Low-Dimensional Representations of High-Dimensional Data

Unformatted text preview:

ICS 278 Data Mining Lecture 5 Low Dimensional Representations of High Dimensional Data Data Mining Lectures Irvine Lecture 5 Dimension Reduction Padhraic Smyth UC Today s lecture Extend project proposal deadline to Monday 8am questions Outline of today s lecture orphan slides from earlier lectures Dimension reduction methods Data Mining Lectures Irvine Motivation Variable selection methods Linear projection techniques Non linear embedding methods Lecture 5 Dimension Reduction Padhraic Smyth UC Notation Reminder n objects each with p measurements data vector for ith object x i x1 i x2 i x p i Data matrix x j i is the ith row jth column columns variables rows data points Can define distances similarities between rows data vectors i between columns variables j Data Mining Lectures Irvine Lecture 5 Dimension Reduction Padhraic Smyth UC Distance n objects with p measurements x x1 x2 x p y y1 y 2 y p Most common distance metric is Euclidean distance p 2 d E x y xk yk k 1 1 2 Makes sense in the case where the different measurements are commensurate each variable measured in the same units If the measurements are different say length and weight Euclidean distance is not necessarily meaningful Data Mining Lectures Irvine Lecture 5 Dimension Reduction Padhraic Smyth UC Dependence among Variables Covariance and correlation measure linear dependence distance between variables not objects Assume we have two variables or attributes X and Y and n objects taking on values x 1 x n and y 1 y n The sample covariance of X and Y is 1 n Cov X Y x i x y i y n i 1 The covariance is a measure of how X and Y vary together it will be large and positive if large values of X are associated with large values of Y and small X small Y Data Mining Lectures Irvine Lecture 5 Dimension Reduction Padhraic Smyth UC Correlation coefficient Covariance depends on ranges of X and Y Standardize by dividing by standard deviation Linear correlation coefficient is defined as n X Y x i x y i y i 1 n n 2 2 x i x y i y i 1 i 1 Data Mining Lectures Irvine Lecture 5 Dimension Reduction 1 2 Padhraic Smyth UC Sample Correlation Matrix 1 0 1 business acreage nitrous oxide average rooms Data on characteristics of Boston surburbs Median house value percentage of large residential lots Data Mining Lectures Irvine Lecture 5 Dimension Reduction Padhraic Smyth UC Mahalanobis distance between objects d MH x y x y 1 x y Evaluates to a scalar distance T Vector difference in p dimensional space 1 2 Inverse covariance matrix 1 It automatically accounts for the scaling of the coordinate axes 2 It corrects for correlation between the different features Cost 1 The covariance matrices can be hard to determine accurately 2 The memory and time requirements grow quadratically O p2 rather than linearly with the number of features Data Mining Lectures Irvine Lecture 5 Dimension Reduction Padhraic Smyth UC Example 1 of Mahalanobis distance Covariance matrix is diagonal and isotropic all dimensions have equal variance MH distance reduces to Euclidean distance Data Mining Lectures Irvine Lecture 5 Dimension Reduction Padhraic Smyth UC Example 2 of Mahalanobis distance Covariance matrix is diagonal but non isotropic dimensions do not have equal variance MH distance reduces to weighted Euclidean distance with weights inverse variance Data Mining Lectures Irvine Lecture 5 Dimension Reduction Padhraic Smyth UC Example 2 of Mahalanobis distance Two outer blue points will have same MH distance to the center blue point Data Mining Lectures Irvine Lecture 5 Dimension Reduction Padhraic Smyth UC Distances between Binary Vectors i 1 j 0 i 1 n11 n10 i 0 n01 n00 Number of variables where item j 1 and item i 0 matching coefficient n11 n 00 n11 n10 n 01 n 00 Jaccard coefficient e g for sparse vectors nonsymmetric n11 Data Mining Lectures Irvine n11 n10 n 01 Lecture 5 Dimension Reduction Padhraic Smyth UC Other distance metrics Categorical variables Number of matches divided by number of dimensions Distances between strings of different lengths e g Patrick J Smyth and Padhraic Smyth Edit distance Distances between images and waveforms Shift invariant scale invariant i e d x y min a b ax b y More generally kernel methods Data Mining Lectures Irvine Lecture 5 Dimension Reduction Padhraic Smyth UC Transforming Data Duality between form of the data and the model Useful to bring data onto a natural scale Some variables are very skewed e g income Common transforms square root reciprocal logarithm raising to a power Often very useful when dealing with skewed real world data Logit transforms from 0 to 1 p to real line logit p 1 p Data Mining Lectures Irvine Lecture 5 Dimension Reduction Padhraic Smyth UC Data Quality Individual measurements Random noise in individual measurements Variance precision Bias Random data entry errors Noise in label assignment e g class labels in medical data sets Systematic errors E g all ages 99 recorded as 99 More individuals aged 20 30 40 etc than expected Missing information Missing at random Questions on a questionnaire that people randomly forget to fill in Missing systematically Questions that people don t want to answer Patients who are too ill for a certain test Data Mining Lectures Irvine Lecture 5 Dimension Reduction Padhraic Smyth UC Data Quality Collections of measurements Ideal case random sample from population of interest Real case often a biased sample of some sort Key point patterns or models built on the training data may only be valid on future data that comes from the same distribution Examples of non randomly sampled data Medical study where subjects are all students Geographic dependencies Temporal dependencies Stratified samples E g 50 healthy 50 ill Hidden systematic effects E g market basket data the weekend of a large sale in the store E g Web log data during finals week Data Mining Lectures Irvine Lecture 5 Dimension Reduction Padhraic Smyth UC Classifier technology and the illusion of progress abstract for workshop on State of theArt in Supervised Classification May 2006 Professor David J Hand Imperial College London Supervised classification methods are widely used in data mining Highly sophisticated methods have been developed using the full power of recent advances in computation However these advances have largely taken place within the context of a classical paradigm in which construction of the classification rule is based on a design sample of data randomly sampled from unknown but well defined distributions of


View Full Document

UCI ICS 278 - Low-Dimensional Representations of High-Dimensional Data

Download Low-Dimensional Representations of High-Dimensional Data
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Low-Dimensional Representations of High-Dimensional Data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Low-Dimensional Representations of High-Dimensional Data 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?