ILLINOIS CS 446 - 083117.1_revised - D3425949

Home> Schools> University of Illinois - urbana> (CS) > CS 446> 083117.1_revised

DOC PREVIEW

ILLINOIS CS 446 - 083117.1_revised

School name University of Illinois - urbana

Course Cs 446- Machine Learning

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 2 : Generalization, Cross-validationLecturer: Sanmi Koyejo Scribe: Hsiao-Ching Chang, Aug. 31st, 2017Recap from Last LectureNotationsWe defined an instance spaceXand an output label spaceY.XandYare spaces whereexamples (xi, yi) are in. That is,xi∈ Xandyi∈ Y. Output label spaces can differ fromtask to task. Below is a list of common tasks and their label spaces.• Binary classification: Y = {−1, 1} (or sometimes, Y = {0, 1}).• Multi-class classification: Y = {1, ..., p}, where p denotes the number of classes.• Regression: Y = R.• Probability regression: Y = [0, 1].• Multi-output regression: Y = Rp, where we try to predict a p-dimensional vector.Dataset is defined asD, whereD={(x1, y1),(x2, y2), ...,(xn, yn)}. We also defined a learner(or hypothesis)h:X → Ywherehtakes inputs from input spaceXand make predictionsto output space Y .Nearest Neighbor ClassifierFor nearest neighbor algorithm, we need to define two more things:•A distance function which takes pairs of inputs and maps to a distance output. Wecan denote this byX × X → R. For example, for two instancesx1, x2ofndimensions,we can use Minkowski distance, which is defined as nXi=1|xi1− xi2|p!1p12 2 : Generalization, Cross-validation, wherepdenotes the order of the metric andxi1denotes thei-th element of instancex1. The metric is called Manhattan distance whenpequals 1, and whenpequals 2, itis called Euclidean distance.• Number of nearest neighbors k.With this setting, the nearest neighbor classifier h can be defined ash(x) = sign1kXi∈Nk(x,D)yi, where Nk(x, D) denotes the k nearest neighbors in dataset D.The main inductive bias we see in nearest neighbor classifiers is called spatial smoothness.That is, nearby points have similar labels, so when we see an unknown instance, we guessthat it belongs to the majority of its neighbors.We also discussed the pros and cons of nearest neighbor classifiers.• Pros– Learning is ”free” since there is no learning or training step.– It works well in practice.• Cons– Testing cost is high.Let’s consider the nearest neighbor classifierhwe defined earlier along withEuclidean distance andX=Rd. Computation complexity will beO(nkd) andmemory consumption will beO(nd). Note that we are estimating the complexitieswith naive implementation. There exist tricks that make this classifier morememory efficient such as using k-d trees or hashing.– It behaves badly in high dimensions.Suppose we have instance spaceX=Rd. Whendcomes to 10,000 or even larger,there will be some issues. Such high dimension problem is actually common inapplications. For instance, if we use bag-of-word vectors for spam classification,dwould be vocabulary size. Other cases include brain imaging data or computervision applications.2 : Generalization, Cross-validation 3EvaluationPerformance MetricsLet’s consider a learnerh, whereh:X → Y. We want to discuss about the performance ofh. Some metrics we commonly use are listed below.• Error rate, which can be computed by1nnXi=11[yi6=h(xi)]• Accuracy, which is 1− error rate, or1nnXi=11[yi=h(xi)]• Precision and recall.• Specificity, which tells how ”specific” the classifier is in predicting positive instances.If we are designing an evaluation metric, there are some things we need to consider:• Incorporation of domain knowledge.• Trade-offs.• Robustness.There are many performance metrics, but we will focus on error rates in this class. Types ofdata we are interested in calculating error rates include training data, test data, and perhapsbootstrap data. Of all kinds of data, we care most about the unknown test data. We caneven say the most important concept in machine learning is to maximize the performancemetrics on these new data. New data could be either typical or adversarial. However, wewill focus on typical data and will not spend much time on adversarial data in this class.Data GenerationWhen we say a data is ”typical”, we can assume there is a fixed data generation engine, andwe can draw data samples (xi, yi) i.i.d. from a population P . That is,x ∼ P(X), y|x ∼ P(Y |X)4 2 : Generalization, Cross-validationFor example, if we have a learnerhand a datasetD={(xi, yi)}ni=1generated from populationP , we can compute our error rate for this dataset withError(h, D) =1nnXi=11[yi6=h(xi)]and our error rate for the whole data population P withError(h, P ) = E(x,y)∼P[1[yi6=h(xi)]] = P(y 6= h(x))If we generate infinite samples, the error rate of the dataset will become the error rate ofthe population. That is, as n → ∞,1nnXi=11[yi6=h(xi)]→ Error(h, P )Same applies to any performance metric. Generally, risk functionsRand utility functionsUare the two functions we consider. We can apply the same thing on these functions such asmeasuringR(h, Dn) andR(h, P). Since the two functions are inverse to the other, we cansimply useU(h, Dn) = −R(h, Dn)in most cases. Take accuracy and error rate for example. Accuracy is a utility function anderror rate is a risk function, thus we have accuracy = 1− error rate.GeneralizationGeneralization ErrorThe generalization error Gncan be computed byGn= Error(hn, P ) − Error(hn, Dn), wherendenotes the number of samples, andhnis the learner learned with then-sampledataset Dndrawn from population P .When we say that a learner h generalizes, as n → ∞,Gn→ 0.Similarly, we can apply it to any risk or utility function. For instance, as n → ∞,R(hn, Dn) → R (hn, P ).2 : Generalization, Cross-validation 5ApproximationSince we never seePin practice, we need some tricks to approximate the performance. Someapproaches include:• Train/test splitData is split into training set and test set. We train the learner on training set anduse the test set to evaluate ”out-of-sample” performance.• Cross-validation– Leave-one-out cross-validation– k-fold cross-validation– ... and many others• BootstrapWe will briefly go through howk-hold cross validation work. Suppose now we have dataD,we first separateDintokpartitions{D1, ..., Dk}. For each subsetDi, we first do trainingonD \ Diand then evaluate the trained learner onDi. After allksubsets are iterated, wecan approximate a final performance by averaging the k performances.When we have a lot of data or limited computation, we typically choose to use train/testsplit. On the other hand, when we only have limited data or we have more computation, weoften choose to use

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

ILLINOIS CS 446 - 083117.1_revised

Sign up for free to view:

Please select your school