# ILLINOIS CS 446 - 083117.1_revised (5 pages)

Previewing pages*1, 2*of 5 page document

**View the full content.**## 083117.1_revised

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 083117.1_revised

0 0 45 views

- Pages:
- 5
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 2 Generalization Cross validation Lecturer Sanmi Koyejo Scribe Hsiao Ching Chang Aug 31st 2017 Recap from Last Lecture Notations We defined an instance space X and an output label space Y X and Y are spaces where examples xi yi are in That is xi X and yi Y Output label spaces can differ from task to task Below is a list of common tasks and their label spaces Binary classification Y 1 1 or sometimes Y 0 1 Multi class classification Y 1 p where p denotes the number of classes Regression Y R Probability regression Y 0 1 Multi output regression Y Rp where we try to predict a p dimensional vector Dataset is defined as D where D x1 y1 x2 y2 xn yn We also defined a learner or hypothesis h X Y where h takes inputs from input space X and make predictions to output space Y Nearest Neighbor Classifier For nearest neighbor algorithm we need to define two more things A distance function which takes pairs of inputs and maps to a distance output We can denote this by X X R For example for two instances x1 x2 of n dimensions we can use Minkowski distance which is defined as n X 1 p xi1 xi2 p i 1 1 2 2 Generalization Cross validation where p denotes the order of the metric and xi1 denotes the i th element of instance x1 The metric is called Manhattan distance when p equals 1 and when p equals 2 it is called Euclidean distance Number of nearest neighbors k With this setting the nearest neighbor classifier h can be defined as h x sign 1 k X yi i Nk x D where Nk x D denotes the k nearest neighbors in dataset D The main inductive bias we see in nearest neighbor classifiers is called spatial smoothness That is nearby points have similar labels so when we see an unknown instance we guess that it belongs to the majority of its neighbors We also discussed the pros and cons of nearest neighbor classifiers Pros Learning is free since there is no learning or training step It works well in practice Cons Testing cost is high Let s consider the nearest neighbor classifier h we defined earlier along with Euclidean distance and X Rd Computation complexity will be O nkd and memory consumption will be O nd Note that we are estimating the complexities with naive implementation There exist tricks that make this classifier more memory efficient such as using k d trees or hashing It behaves badly in high dimensions Suppose we have instance space X Rd When d comes to 10 000 or even larger there will be some issues Such high dimension problem is actually common in applications For instance if we use bag of word vectors for spam classification d would be vocabulary size Other cases include brain imaging data or computer vision applications 2 Generalization Cross validation 3 Evaluation Performance Metrics Let s consider a learner h where h X Y We want to discuss about the performance of h Some metrics we commonly use are listed below Error rate which can be computed by n 1X 1 yi 6 h xi n i 1 Accuracy which is 1 error rate or n 1X 1 yi h xi n i 1 Precision and recall Specificity which tells how specific the classifier is in predicting positive instances If we are designing an evaluation metric there are some things we need to consider Incorporation of domain knowledge Trade offs Robustness There are many performance metrics but we will focus on error rates in this class Types of data we are interested in calculating error rates include training data test data and perhaps bootstrap data Of all kinds of data we care most about the unknown test data We can even say the most important concept in machine learning is to maximize the performance metrics on these new data New data could be either typical or adversarial However we will focus on typical data and will not spend much time on adversarial data in this class Data Generation When we say a data is typical we can assume there is a fixed data generation engine and we can draw data samples xi yi i i d from a population P That is x P X y x P Y X 4 2 Generalization Cross validation For example if we have a learner h and a dataset D xi yi ni 1 generated from population P we can compute our error rate for this dataset with n 1X Error h D 1 yi 6 h xi n i 1 and our error rate for the whole data population P with Error h P E x y P 1 yi 6 h xi P y 6 h x If we generate infinite samples the error rate of the dataset will become the error rate of the population That is as n n 1X 1 yi 6 h xi Error h P n i 1 Same applies to any performance metric Generally risk functions R and utility functions U are the two functions we consider We can apply the same thing on these functions such as measuring R h Dn and R h P Since the two functions are inverse to the other we can simply use U h Dn R h Dn in most cases Take accuracy and error rate for example Accuracy is a utility function and error rate is a risk function thus we have accuracy 1 error rate Generalization Generalization Error The generalization error Gn can be computed by Gn Error hn P Error hn Dn where n denotes the number of samples and hn is the learner learned with the n sample dataset Dn drawn from population P When we say that a learner h generalizes as n Gn 0 Similarly we can apply it to any risk or utility function For instance as n R hn Dn R hn P 2 Generalization Cross validation 5 Approximation Since we never see P in practice we need some tricks to approximate the performance Some approaches include Train test split Data is split into training set and test set We train the learner on training set and use the test set to evaluate out of sample performance Cross validation Leave one out cross validation k fold cross validation and many others Bootstrap We will briefly go through how k hold cross validation work Suppose now we have data D we first separate D into k partitions D1 Dk For each subset Di we first do training on D Di and then evaluate the trained learner on Di After all k subsets are iterated we can approximate a final performance by averaging the k performances When we have a lot of data or limited computation we typically choose to use train test split On the other hand when we only have limited data or we have more computation we often choose to use cross validation

View Full Document