UCI ICS 273A - Problem I Learning theory - D1641569

Home> Schools> University of California, Irvine> (ICS) > ICS 273A> Problem I Learning theory

DOC PREVIEW

UCI ICS 273A - Problem I Learning theory

School name University of California, Irvine

Course Ics 273a- Machine Learning

Pages 3

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

ICS273A: HW4Due: March 5Problem I: Learning theory. One of the basic PAC assumptions is thatthe training and test data pairs (x, y) are generated IID from some distributionD = p(x, y). In this problem, we’ll consider a case where the training and testdistributions are different. Specifically, we’ll look at the case where the traininglabels y are noisy, but the test labels are not.Let y ∈ 0, 1 and D be the clean, uncorrupted distribution. Define Dτto be acorrupted distribution over (x, y), which is the same as D, except that the labelsy have some probability 0 ≤ τ ≤ 0.5 of being flipped. Thus, to sample from Dτ,we would first sample (x, y) from D, and then with probability τ (independentof the observed (x, y)), replace y with 1 − y. Thus D0= D.The distribution Dτcaptures the setting in which a human is labeling thetraining data, but has a probability τ of mislabeling a particular example. Ul-timately, we are interested in evaluating a hypothesis h with respect to theoriginal, uncorrupted distribution D.We define the generalization error with respect to Dτto beετ(h) = Px,y∼Dτ[h(x) 6= y]Note that ετ(h) is the generalization error with respect to the original distribu-tion; this is error we would like to minimize.A For any hypothesis h, ε0(h) can be calculated as a function of ετ(h) andτ. Derive a formula for ε0(h) interms of ετ(h) and τ.B Let |H| be finite, and suppose our training set S = {(x(i), y(i))} is obtainedby drawing m IID samples from the corrupted distribution Dτ. Supposewe pick h ∈ H that minimizes the empirical risk, orˆh = arg minh∈HˆεS(h).We have written ˆεS(h) for the empirical risk measured on the dataset S.Also let h∗= arg minh∈Hεo(h). Let any δ, γ be given. Prove that forε0(ˆh) ≤ ε0(h∗) + 2γto hold with probability at least 1 − δ, it suffices thatm ≥12(1 − 2τ )2γ2log2|H|δ1Note. This suggests that roughly m examples that have been corruptedat a noise level of τ are worth about as much as (1 − 2τ )2m uncorruptedtraining examples.C What happens as τ approaches .5?Problem II: Kernels. Consider a set of vectors S = {x(1), . . . , x(m)}. Letk(x, y) = φ(x)Tφ(y), the inner product in some feature space. Let K be amatrix where the i, j element is k(x(i), x(j)). Let C denote the center of massof the vectors in feature space: C =1mPmi=1φ(xi).A Show that one can compute the distance ||φ(x(i)) − φ(x(j))||2using K.B Show how to compute the squared norm of the center vector ||c||2usingK.C Suppose we define a new feature space φc(x) = φ(x) − C. Show that thecenter of mass of the points in this space is at the origin.D Let Kcbe the kernel matrix corresponding to the mapping φc(x). WriteKcin terms of K.Problem III: [MATLAB] SVMs. We will be using a quadratic program-ming package to train a hard-margin kernel SVM. We will be using the classifi-cation data from hw4Train.dat and hw4Test.dat. We will be using MATLAB’squadprog function to optimize the dual formation.A To use the optimization packages, we need to write the dual hard-marginSVM formulation as minimizing a quadratic function12xTHx + fTx sub-ject to the single linear inequality Ax ≤ b and the single linear equalityAeq= beq. Explicitly define the H, f, A, b, Aeq, beq, x in terms of our vari-ables {αi, xi, yi}.B Attempt to train a linear hard-margin SVM from hw4Train.dat using theabove formulation. Explain why this cannot be done.C Use a Gaussian kernel kλ(x, y) = exp −1λ||x − y||2with λ = .1. Visualizethe decision boundary by plotting the test points with an ’x’ if its predictedto be 0, and ’o’ if its predicted to be 1. Compute the test error rate.Problem IV: [MATLAB] Face recognition and dimensionality reduction.We will build a face recognizer using faceTrain.mat and faceTest.mat with anearest neighbor classifier. We will use principle component analysis (PCA)to reduce the dimensionality of the raw image features. This face recognitionalgorithm is known as eigenfaces, and is competitive with state-of-the-art sys-tems. We will also experiment with performing nonlinear component analysiswith kernel PCA.2A You can visualize the faces using the same code visualizing digit images.There are 40 different people in this dataset, so this is a multiclass classifi-cation task with 40 classes. Construct a k-nearest neighbor classifier usingeuclidean distance and score the miss-classification rate on the test data.Select the optimal k by cross-validation. When k is even, you may needto break ties randomly at test time. This will be our baseline algorithm.B Reduce the dimensionality of the training data by performing PCA. Be-cause the dimensionality of the data is quite high, use the matlab functionsvds to compute the eigenvectors iteratively. At first, project the datadown to 10 dimensions. Visualize the first, second, and third principlecomponents as images.C Pick a particular face from the training set and visualize its reconstructionusing 1, 2, and 10 principle components.D Perform classification on the test data by also projecting the test datainto the 10 dimensional space, and then performing nearest neighbor clas-sification. Experiment with the optimal number of dimensions throughcross-validation.E Use kernel-PCA to perform the dimensionality reduction. Recall thatPCA, in one dimensions, computes the direction v that maximizes thevariance of a linear projection vTx(i). We derived in class that this can besolved with an eigenvector problem1mXTXv = λv, where X is the designmatrix for centered datapoints x(i). Assume that the vector v lies in thespan of the data, or v =Piαix(i). We can compute the same direction interms of the vector of coefficients α. Show that this yields an eigenvectorproblem of the form XXTα =˜λα. What is˜λ in terms of λ?F (i) Show how the above “α-based” version of PCA can be kernelized, soone that is computing the spanning coefficients of a vector v in the featurespace φ(x). (ii) Show how to compute the projection of data point xnewonto the principle component vector in the feature space.Extra credit Implement kernel PCA for your eigenface classifier. Note that the dataneeds to be centered in the feature space. You derived the expressionfor centering a kernel matrix in Problem II. Project down to 1,2, and 5dimensions in the feature space and build a nearest neighbor classifier.Does the nonlinear dimensionality reduction reduce the test

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 3 pages.

UCI ICS 273A - Problem I Learning theory

Sign up for free to view:

Please select your school