UCI ICS 273A - Final answers - D2192607

Home> Schools> University of California, Irvine> (ICS) > ICS 273A> Final answers

DOC PREVIEW

UCI ICS 273A - Final answers

School name University of California, Irvine

Course Ics 273a- Machine Learning

Pages 3

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Final, ICS-273A, Intro ML, Tu. Jun. 09, 4 - 6 p.m.Instructor: Max WellingCLOSED BOOKTotal 100 points1. (20 pts) Miscellaneousa.(5 pts) List at least 4 different classification algorithms we have discussed in class.A: Logistic regression, k-nearest neighbors, decision trees, boosting, neural net-works, support vector machines. If you include Fisher-LDA you have toargue how you do classification with it.b.(5 pts) Complete the following sentence: ”Random forests are decision trees com-bined with ....”.A: “bagging”c.(5 pts) True or False: The K-means algorithm almost always converges, but in ex-ceptional cases it might oscillate between different cluster assignments andnever converge.A: False, K-means must necessarily converge.d.(5 pts) True or False: Because we have to choose the number of neighbors, K, in aK-nearest neighbor method it is a parametric procedure.A: False: nonparametric methods may have parameters. KNN is using all thedata for making predictions and the complexity of the decision boundaryincreases with an increasing dataset size. Hence we call it a nonparametricmethod.2. (15 pts) Kernel SVMsConsider the prediction rule for an SVM: y∗= sign£Pm∈SVαmymK(xm, x∗)¤,where SV is set of support vectors, ym= ±1 are the classification labels, Kis the kernel and αmare the weights returned from the SVM dual optimizationalgorithm.a.(5 pts) Consider a training case that was on the correct side of the decision boundary(the SVM would predict it’s class correctly). Is this training case part of theset SV ?A: No, it’s value for α would be zero and it would therefore be excluded fromthe set SV .b.(5 pts) Suppose we have a small finite set of input features φa(x), a = 1..A. If wewould be looking for linear classification boundaries, what is the expressionfor the kernel in terms of φ?A: K(xn, xm) =Paφa(xn)φa(xm) = φ(xn)Tφ(xm)c. (5 pts) Suppose we are using a radial basis functions kernel (a Gaussian kernel):K(xn, xm) = c exp[−12||xn− xm||2/σ2]. The RBF kernel has a parameterσ for the width of the kernel (similar to the standard deviation of a Gaussiandistribution). What would happen if you would optimize this parameter onthe training set? Explain what the training set performance would be andwhat the possible effect on the test set performance could be. Briefly describean alternative method to estimate a good value for σ.A: It would go to 0. Training cases would be infinitely dissimilar to all othertraining cases and so prediction of training labels would be perfect. However,it would also lead to overfitting potentially deteriorating test error. You canuse cross-validation to estimate it.3. (25 pts) Principal Component Analysis (PCA)Suppose we have a dataset Xinwith A attributes: i = 1 : A and N data-cases:n = 1 : N. Assume the data is centered (zero mean),PNn=1Xin= 0, ∀i.a.(5 pts) Provide the expression for the sample covariance matrix, S, as a function ofX.A: Sij=1NPNn=1XinXjn, or S = XXT/Nb.(10 pts) Given is an eigenvalue decomposition of S, namely S = UΛ UT. What isthe expression for the linear projection, L that maps the data X onto a Kdimensional subspace, Y = LX, such that the projected data Y has zeromean and sample covariance I (identity)? Prove that your answer is correct.A: L = Λ−12kUTk. To prove the claim you compute the meanPNn=1Yin=LPNn=1Xin= 0 and Sy= Y YT/N = Λ−12kUTkXXT/N UkΛ−12k=Λ−12kUTkUΛUTUkΛ−12k= Ikc.(5 pts) Approximate X with the projection Z = UkUTkX. Express the L2-errore = ||X − Z||2in terms of the eigenvalues Λ. You don’t have to prove youranswer, a geometrical argument is also fine.A: Z represents the projection of X on the subspace spanned by the K largesteigenvalues. This means that the variance in the orthogonal is removed. Thatvariance is given by the total sum of the eigenvalues larger than K, hence theanswer is e =Pl>KΛll.d.(5 pts) Is PCA a supervised or unsupervised method?A: Unsupervised: there are no labels4. (20 pts) Fisher Linear Discriminant Analysis (FLDA)Suppose we have a dataset {Xin, Yn} with A attributes: i = 1 : A and N data-cases: n = 1 : N. Y represent the labels, given as ±1. Assume the data iscentered (zero mean),PNn=1Xin= 0, ∀i.a. (5 pts) Provide the objective function, J(w), for FLDA, where w is the pro-jection vector. Express every term (such as scatter matrices) as a function ofthe data X, Y . In other words, given some w, explain how you can computeJ(w ) from the data.A: J = wTSBw/wTSWw, where SBis the between groups scatter matrix andSWthe within groups scatter matrix. See class-notes for their definitions.b. (5 pts) Reformulate this objective as a constrained quadratic optimizationproblem and derive the KKT conditions.A: Note that you can scale w → αw and thus set wTSWw = 1. You canthen maximize the numerator of J only subject to wTSWw = 1. The KKTconditions are SBw = λSWw and wTSWw = 1.c. (5 pts) Now solve for w by transforming the KKT conditions into an ordinaryeigenvalue equation. Express the solution for w in terms of the solution tothat eigenvalue equation.A: S12BS−1WS12B[S12Bw] = λ[S12Bw] and solve for v = S12Bw and then invert to getw.d. (5 pts) Give an example of data in 2 dimensions where the first principalcomponent and the Fisher-LDA projection would be orthogonal.A: Two long and skinny distributions lying in parallel. PCA will choose thedirection along the cigar shapes, while FLDA will pick the one orthogonal toit.5. (20 pts) Reporting Classification ResultsIn the following we will be estimating the test-errors of two learning algorithmsA1and A2. To do so, we have created M datasets of N training data-cases each,randomly sub-sampled from a much bigger dataset. Our procedure is to applyboth algorithms to all M data-sets and compute their classification errors on anindependent set of test data-cases. Call e1k, e2kthe errors made by algorithmA1, A2in experiment k.a.(5 pts) Explain what could go wrong if instead of an independent test set, we woulduse the same training data to compute the classification errors e1k, e2k.A: You could produce highly overconfident results. In fact you could drive theerrors to zero for any sufficiently complex learning machine and think youwould have invented the best learning algorithm to date.b.(5 pts) Call δ the difference between the two mean errors: δ = m1− m2, withmj=Pkejk/M. Now suppose that the variance of δ, vδ, is much largerthan δ2: e.g. vδ= 10δ2. Does this imply that we must conclude that thelearning

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 3 pages.

UCI ICS 273A - Final answers

Sign up for free to view:

Please select your school