UCSD ECE 271A - Homework Set Three - D1241878

Home> Schools> University of California, San Diego> Electrical & Computer Engineer (ECE) > ECE 271A> Homework Set Three

DOC PREVIEW

UCSD ECE 271A - Homework Set Three

School name University of California, San Diego

Course Ece 271a- Statistical Learning I

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Homework Set ThreeECE 271ADepartment of Computer and Electrical EngineeringUniversity of California, San DiegoNuno Vasconcelos Fall 2007Due November 15, 20071. In this problem we will consider the issue of linear regression and the connections between maximumlikelihood and least sq uares solutions. Consider a problem where we have two random variables Z andX, such thatz = f (x, θ) + ǫ, (1)where f is a polynomial with parameter vector θf(x, θ) =KXk=0θkxk(2)and ǫ a Gaus sian random variable of ze ro mean and var iance σ2. Our goal is to estimate the bestestimate of the function given an iid sample D = {(Dx, Dz)} = {(x1, z1), . . . , (xn, zn)}.a) Formulate the problem as one of least squares, i.e define z = (z1, . . . , zn)T,Φ =1 . . . xK1......1 . . . xKnand find the value of θ that minimizes||z − Φθ||2.b) Formulate the problem as one of ML estimation, i.e. write down the likelihood function PZ|X(z|x; θ),and compute the ML estimate, i.e. the value of θ that maximizes PZ|X(Dz|Dx; θ). Show that this isequivalent to a).c) (The advantage of the statistica l formulation is that makes the assumptions explicit. We will nowchallenge some of these.) Assume that instead of a fixed variance σ2we now have a variance thatdepends on the sample point, i.e.zi= f (xi, θ) + ǫi, (3)where ǫi∼ N (0 , σ2i). This means that our sample is independent but no longer identically distributed.It also means that we have different degrees of confidence in the different meas urements (zi, xi). Redob) under these conditions.d) Consider the weighted least squares problem where the goal is to minimize(z − Φθ)TW(z − Φθ)1where W is a symmetric matrix. Compute the optimal θ for this situation. What is the equivalentmaximum likelihood pr oblem? Rewrite the model (1), making explicit all the assumptions that lead tothe new problem. What is the statistical interpretation of W?e) The L2norm is known to be prone to large estimation error if there are outliers in the training sample.These are training examples (zi, xi) for which, due to measureme qnt errors or other extraneous causes,|zi−Pθixi| is much larger than for the remaining examples (the inliers). In fact, it is known that asingle outlier can completely derail the leas t squares solution, an highly undesirable behavior. It is alsowell known that other norms lead to much more robust estimators. One of such distance metrics is theL1-normL1=Xi|zi−Xkθkxki|.In the maximum likelihood framework, which is the statistical assumption that leads to the L1norm?Once again, rewrite the model (1), making explicit all the assumptions that lead to the new problem.Can you justify why this alternative formulation is more robust? In particular, provide a justificationfor i) why the L1norm is more robust to outliers, and ii) the associated statistical model (1 ) copesbetter with them.2.a) Problem 3.5.17 in DHS.b) What is the ML estimate for θ in this problem? What is the MAP estimate for θ in this problem?Do you see any advantage in favoring one of the estimates in favor of the others? How does that relateto the uniform prior that was assumed for θ?3. Consider problem 3 of the previous assignment, i.e. a random varia ble X such that PX(k) = πk, k ∈{1, . . . , N }, n independent observations from X, a random ve c tor C = (C1, . . . , CN)Twhere Ckis thenumber of times that the obse rved value is k (i.e. C is the histogram of the sample of observations).We have seen that, C has multinomial distributionPC1,...,CN(c1, . . . , cN) =n!QNk=1ck!NYj=1πcjj.In this problem we are going to compute MAP es timates for this model. Notice that the parametersare probabilities and, therefo re, not every prior will be acceptable here (since πj> 0 andPjπj= 1 forthe prior to be valid). One distribution over vectors π = (π1, . . . , πN)Tthat satisfies this constraint isthe Dirichlet distr ibutionPΠ1,...,ΠN(π1, . . . , πN) =Γ(PNj=1uj)QNj=1Γ(uj)NYj=1πuj−1jwhere the ujare a set of hyperparameters (parameters of the prior) andΓ(x) =Z∞0e−ttx−1dtthe Gamma function.2a) Derive the MAP estimato r for the parameters πi, i = 1, . . . , N using the Dirichlet prior.b) Compare this estimator with the ML estimator derived in the previous assignment. What is the useof this prior eq uivalent to, in terms of the ML solution?c) What is the effect o f the prior as the number of samples n increases? Does this make intuitive sense?d) In this problem a nd problem 2. we have seen two ways of avoiding the computational complexityof computing a fully Bayesian solution: i) to rely on a non-informative prior, and ii) to rely on aninformative prior and compute the MAP solution. Qualitatively, what do you have to s ay about theresults obtained with the two solutions? What does this tell you about the robustness of the Bayesianframework?4. (computer) (Note: This week’s computer problem requires more computation time than theprevious ones - about 10h by our estimates. For that reason, I will give you more time to finish it up,i.e. it will be repeated in homework 4. Do not, however, take this as a reason to not start working onit right away. You may simply not have time to finish if leave it to the days prior to the deadline.)This week we will continue trying to classify our cheetah example. Once again we use the de-composition into 8 × 8 image blocks, compute the DCT of each block, and zig-zag scan. We alsocontinue to assume that the class-conditional densities are multivariate Gaussians of 64 dimensions.The goal is to understand the be nefits o f a Bayesian solution. For this, using the training data inTrainingSamplesDCT new 8.mat we created 4 da tasets of size given by the table below. They areavailable in the file TrainingSamplesDCT subsets 8.matDataset cheetah examples grass examplesD175 300D2125 500D3175 700D4225 900We start by setting up the Bayesian model. To s implify things a bit we are going to cheat a little.With respect to the class-conditional,Px|µ,Σ= G(x, µ, Σ).we assume that we know the covariance matrix (like Bayes might) but simply replace it by the samplecovariance of the training set, D, that we ar e working with (and hope he doesn’t notice)1. That is, weuseΣ =1NNXi=1 xi−1NNXi=1xi! xi−1NNXi=1xi!TWe are, however, going to assume unknown mean and a Gaussian prior of mean µ0and covariance Σ0Pµ(µ) = G(µ, µ0, Σ0).Regarding the mean µ0, we assume tha t it is zero for all coefficients other tha n the first (DC) while

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

UCSD ECE 271A - Homework Set Three

Sign up for free to view:

Please select your school