# ILLINOIS CS 446 - 083117.2 (7 pages)

Previewing pages*1, 2*of 7 page document

**View the full content.**## 083117.2

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 083117.2

0 0 50 views

- Pages:
- 7
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 2 Generalization Cross validation Lecturer Sanmi Koyejo 1 Scribe Taoyu Lu Aug 31st 2017 Announcements 1 1 No waitlist policy The CS academic office advised not to maintain a waiting list for CS 446 Machine Learning For students who are not registered and expect to be enrolled they should self register for the course when there are open spots available However non registered students can get access to most of the course materials including Piazza and RELATE The instructor is working on making course recordings available for all as well For more details regarding course registration please contact CS academic office via phone 217 333 4428 or email academic cs illinois edu 1 2 Schedule updated The course schedule is now updated under Schedule section on RELATE course website The current schedule is not the final version Topics might change 1 3 Reading materials Pre lecture reading recommendations are noted on the schedule Students are expected to have sufficient mathematical knowledge to keep up with the concepts 1 2 2 Generalization Cross validation 2 Previous review 2 1 Notations Assuming every example x is in an instance space X This can be written as xi X Define an output space label spaceY in which yi Y Output spaces can be Binary classification Y 1 1 or Y 0 1 Multiclass classification a generalization form of binary classification Y 1 p Regression Y R Probability regression y 0 1 Multi output i e multi output regression y Rp where the goal is to predict a p dimension vector Data set D x1 y1 x2 y2 xn yn Learning algorithm learner hypothesish X Y 2 2 Nearest neighbors A distance function maps a pair of inputs to their distance X X R Euclidean distance xi xj 22 P Minkowski distance ni 1 xi yi p 1 p Define K the number of nearest neighbors With the previous being defined a nearest neighbors classifier can be interpreted as follows X 1 h x sign yi K i Nk x D where NK x D are the indices of all examples that are K nearest neighbors of the example x using the data set D This general algorithm works for classification regression multiclass classification etc The main inductive bias of a nearest neighbors classifier is that nearby points have similar labels which is also called spatial smmothness This depends on the distance function and K because the distance function defines what nearby is and K defines how far you are averaging Pros of NN 2 Generalization Cross validation 3 Learning is free There is no training needed Works well in practice Cons of NN Test cost is high Consdering a nearest neighbors classifier h x sign 1 K P i NK x D yi with euclidean distance K instance space X Rd computation complexity O nkd and memory complexity O nd Comparing to other learning algorithms in performance assumptions they are making and computation and memory overhead the nearest neighbors classifier is not fast enough for real applications However there are tricks to make it faster and more memory efficient such as kd trees hashing etc It behaves badly with high dimension inputs For an instance space X Rd nearest neighbors classifier will have some trouble if d becomes too large i e 10 000 or larger In cases of real application d often gets really large in dimension For example for a spam classification using text d would be the same size as a dictionary In the textbook the author explains that this is due to the curse of dimensionality the mothod is no longer very local The trouble with looking at neighbors that are so far away is that they may not be good preditors about the behavior of the input output function at a given point 1 3 Evaluation 3 1 Performance metrics To evaluate the performance of a learnerh X Y following metrics can be used Error rate n Error rate 1X I yi 6 h xi n i 1 I e is the indicator function defined as follows I e 1 If e is true 0 If e is f alse In this course error rates could be of a training data b unknown new data i e validation test bootstrap or test data c unknown bootstrap 4 2 Generalization Cross validation Accuracy n Accuracy 1 Error rate 1X I yi h xi n i 1 Specificity Precision Recall 3 2 Design of evaluation metric It s useful to get familiar with common metrics that people use to evaluate the performance of a particular set of problems Both in theory and practice there are infinite combination of metrics for a particular problem When designing evaluation metrics we should consider Domain knowledge Tradeoffs Robustness In this course we will use existing evaluation metrics instead of spending time to design our own metrics But in practice machine learning algorithms are tightly coupled with knowledge of how to measure performance 4 Generalization 4 1 Typical data New data either validation test or bootstrap can be typical or adversarial Typical data comes from a fixed data generation engine Assume that samples xi and yi are drawn from distribution P This can be interpreted as xi yi iid P where iid stands for independent and identically distributed One good assumption of how the data gets generated is x P x Y X P Y X The data we observe are samples from the distribution Inspite of the fact that we are not able to see the distribution we can make valuable predictions for the entire population by training our algorithm with existing samples This is one of the key core ideas of machine learning 2 Generalization Cross validation 4 2 5 Example error rate computation Error of classifier h for data set D xi yi ni 1 n Error h D 1X I yi 6 h xi n i 1 Error for the population P Error h P E x y P I y 6 h x P y 6 h x When there are infinite samples in the data set the error rate of the data set converges to that of the population n Error h D 1X n I yi 6 h xi Error h P n i 1 Same applies to any perfermance metric i e Assuming R for risk function and U for utililty function We can perform the same deduction on the risk of a particular classifier h for a particular data set D R h Dn and the risk of h for a particular population P R h P In most cases to convert a risk to a utility we can simply use U h Dn R n Dn 4 3 Generalization error generalization Generalization error For a dataset of size n with a classifier hn if we calculate both the error rate for the population and the population generalization error is the difference between the two error rates That is Gn Error hn P Error hn Dn n In particular We say that a learner or hypothesis generalizes whenGn 0 This is also true for any risk or utility n R …

View Full Document