ILLINOIS CS 446 - 082917.2 - D3425948

Home> Schools> University of Illinois - urbana> (CS) > CS 446> 082917.2

DOC PREVIEW

ILLINOIS CS 446 - 082917.2

School name University of Illinois - urbana

Course Cs 446- Machine Learning

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 1 : Intro to ML, Nearest Neighbor ClassifiersLecturer: Sanmi Koyejo Scribe: Davis Freimanis, Aug. 29th, 2017Machine Learning Introduction, Basics, Key IdeasMachine learning is everywhere. And in the beginning of the lecture a few machine learningalgorithms were written down from the audience1presented in Figure 1:Figure 1: AlgorithmsThe way Sanmi organized the given algorithms was by putting unsupervised (generative)algorithms on the bottom, supervised on the top, probabilistic on the left and nonprobabilisticon the right. The given algorithms doesn’t cover all of machine learning, but it covers quite1Gradient decent was also mentioned, but is not an algorithm by itself. It is a way to optimize algorithms.This method will be covered later in this class.12 1 : Intro to ML, Nearest Neighbor Classifiersa bit. This is only a broad categorization of algorithms and there are algorithms in betweenof these categories.Probabilistic algorithms are built around learning some kind of probability distribution asa way to do machine learning, while nonprobabilsitic algorithms don’t involve learning adistribution directly. We will cover probabilistic and nonprobabilistic algorithms in thisclass quite a bit.In supervised algorithms data is presented asD={(xi, yi)}wherexiis the input andyiis its label. In unsupervised algorithms, data doesn’t include labels and is presented as:D={(xi)}. In supervised algorithms the goal is to learn a mapping from inputsxtoygiven the data set and the goal in unsupervised is to find some kind of structure in thex’s.There are also semi-supervised algorithms where there is a mix of labeled and unlabeleddata. Reinforcement learning was also mentioned, but will not be covered in this class unlesswe have time in the end. There are other classes covering that e.g. the AI course (CS440).LearningIn this section we will cover the following learning functions: constant, memorization andk-nearest neighbors.We want to build a learner for supervised learning and are given the dataD= (x1, y1). . .(xn, yn)where each of the items are input/output pairs. The goal is to learn some functionhsuchthat h(xn+1) gives an accurate prediction ˆyn+1.Constant functionOne of the simplest learners to this problem would be a constant functionh(x) =cwherecis a constant and yiis binary and can hold the value ±1. The constant could be chosen asthe average or the most popular value in the data set. One good property of this function isthat it has low variance which means that if it is trained on new data sets, then it will givea fairly constant answer, but it does not adapt to whatever we try to solve for that reason.MemorizationA smarter function is memorization and can be represented as follows:h(x) =(y(x) if you have seen x before? if x is new1 : Intro to ML, Nearest Neighbor Classifiers 3and what it does is that given a set of input points, the algorithm looks in a database andchecks weather or not it has seen the point before and the assumption is that if it has seenthe point before then it makes sense that it would have the same label.There are a few pros and cons for memorization. One of the main pros is that it requires notraining, but the big con is that it cannot predict a label for new data (no generalization).Testing is expensive and it gets slower as the data set grows. It also requires high storage tobe able to store and look-up the data. Is it enough to be 100% confident that it is accurate?We will talk about this in the next couple of lectures in more detail, but the short answer isthat it is not enough because you can have the same x with different labels.Nearest Neighbor ClassifierFor the nearest neighbor classifier (NN), a demo was presented using Scikit-learn Pedregosaet al. (2011) and Jupyter notebook. The demo was a modification of the Nearest NeighborsClassification example on Scikit-learn.The data set used in the example was the Iris data set Wikipedia (2017) that consists of 3different species of Iris with 4 features (the length of the sepal and petal), but for the sakeof simplicity and visualization we looked only at 2 of the features.In the first example shown in Figure 2, we have a 3-class classification withk= 1, wherekis the number of nearest neighbors checked. So for every prediction point we look at, wegive it the label of the nearest point in the training set and make a prediction based on thatsingle point. The contours in Figure 2 is from making a prediction on all points across the2D grid.Figure 2: 3 classes, k = 1For the second example, 2 classes were merged to create a 2-class classification (or binaryclassification) withk= 1,k= 10 andk= 100. This is why the algorithm sometimes iscalled k-nearest neighbors (kNN). In Figure 3a we can see the classification boundary. When4 1 : Intro to ML, Nearest Neighbor Classifierswe increases tok= 100 as seen in Figure 3b we get an interesting result due to the factthatkis larger than the number of samples in the training set. In this case there are morered points than blue points and therefore there will be a majority of red nearest neighborsacross the 2D grid.(a) 2 classes, k = 10 (b) 2 classes, k = 100(c) Accuracy graph fork= 1 to99Figure 3: 2 separated classesIn Figure 3c we see what happens when we go from 1 to 99 nearest neighbors in the previousexample. The reason that the accuracy drops whenkgrows is because we end up predictingthe majority class every time which results in bad predictions. If we instead merge thepoints as shown in Figure 4a we get a graph shown in Figure 4b. In the beginning we havea very noisy labeling and we make a lot of mistakes, but as we increase the neighbors wemake better predictions. This proves that clear separation of points is not required in a kNNclassification, but it is bias towards separated data sets.(a) 2 classes, k = 10 (b) Accuracy graph for k = 1 to 99Figure 4: 2 scrambled classesThe performance of kNN depends on parameters that we set instead of learn and they arecalled hyperparameters. For every algorithm that we will look at in this class, we will seethat the performance depends on the hyperparameters. For kNN, one of the hyperparametersis the distance formula, as the performance highly depends on which formula we choose.1 : Intro to ML, Nearest Neighbor Classifiers 5More formally we can define kNN as:h(x) = sign1kXz∈Nk(x,D)y(z)whereNk(x, D) is the set of theknearest points andh(x) is the prediction. The idea is

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 7 pages.

ILLINOIS CS 446 - 082917.2

Sign up for free to view:

Please select your school