Princeton COS 424 - Lecture 4/3 - D2997064

Home> Schools> Princeton University> Computer Science (COS) > COS 424> Lecture 4/3

Princeton COS 424 - Lecture 4/3

Pages 2

Download Save

Unformatted text preview:

DRAFT — a final version will be posted shortlyCOS 424: Interacting with DataLecturer: David Blei Lecture 4/3Scribe: Jeehyung Lee1 Squared bias and variance of estimatesGiven data (x1:N, y1:N),MLEˆB = argB(max(log(y1:N|x1:N, B)))Suppose we know a true value B of some data. Suppose we sample random data usingtrue value B based on Gaussian distribution. Then, the estimateˆB based on this data isnot necessarily B.Now suppose we observe a new input/output pair (xo, yo). The squared error of theestimate of yois,(ˆBxo− Bxo)2This value indicates how close the prediction of the estimateˆB to that of the true B.ConsideringˆB as a random variable, we can estimate Mean Squared Error of the esti-mates of yo(Let ED[ˆB] denote E[ˆB(D)], where D is a distribution of data from whichˆB isestimated).MSE(ˆBxo) = ED[(ˆBxo− Bxo)2]We can expand this equation to,MSE(ˆBxo) = ED[(ˆBxo)2] − 2ED[ˆBxo]Bxo+ (Bxo)2(Remember that B and xoare fixed values here)We add zero term (ED[ˆBxo]2− ED[ˆBxo]2) to this equation,MSE(ˆBxo) = ED[(ˆBxo)2] − 2ED[ˆBxo]Bxo+ (Bxo)2+ ED[ˆBxo]2− ED[ˆBxo]2Now the equation is equivalent to(ED[ˆBxo] − Bxo)2+ (E[(ˆBxo)2] − E[ˆBxo]2)The first term is a ”squared bias (Bias2(ˆB))” and the second term is a ”variance ofestimates (V ar(ˆB))”.According to Gauss-Markov Theorem, MLE is the unbiased estimator with the smallestvariance. In other words, ifˆB is a MLE, the squared bias will be 0 and the variance willbe the smallest.The prediction error, which is defined by the following equation,ED[Eyo[(ˆBxo− yo)2]]is equal to,δ2+ V ar(ˆB) + Bias2(ˆB)where δ2is the variance of data (yo∼ N (Bxo, δ2)).2 RegularizationThe basic idea of regularization is to trade V ar(ˆB) and Bias2(ˆB) by placing constraints onˆB. This has 3-fold advantages,• Encourages smaller and simpler models• Makes the model robust to overfitting• Makes the model more interpretableOne way to do this is Ridge Regression - to optimize RSS subject to constraint s onsquared sum of coefficients. As s becom es bigger, we have a better chance of reducing error,but suffer a bigger variance.ˆB of Ridge Regression can be calculated by solving the following equation,ˆBridge= argB(min(PNi=112(yn− Bxn)2+ λPpi=1B2))(The term λ determines the size of the ”Ball” constraining B)This is a convex problem and can be solved efficiently.As for the choice of λ, we choose λ from cross-validations to minimize test error - Forcandidate values of λ (i.e, grid between 0 ∼ 1) and for each fold, calculateˆBridgeand getaverage error of within-fold samples. We choose λ that minimizes the average error.3 Bayesian StatisticsParameter θ ∼ Go(α)yn∼ F (θ)Posterior p(θ|y1:N, α)where Go(α) is a prior distribution and α is called ”hyperparameter”.To calculate MLE, Bayesians choose θ to maximize likelihood of y1:N. Bayesian estimatesgive up on bias to reduce

View Full Document


School:
Email:
New Password:
Confirm Password:

Princeton COS 424 - Lecture 4/3

Sign up for free to view:

Please select your school