Matrix MLE for Linear RegressionJoseph E. GonzalezSome people have had some trouble with the linear algebra form of the MLE for multiple regression. I tried to find a niceonline derivation but I could not find anything helpful. So I have decide to derive the matrix form for the MLE weights forlinear regression under the assumption of Gaussian noise. The ModelLets say we are given some set of data X and y. The matrix X has n rows corresponding to each of the examples and dcolumns corresponding to each of the d features. The column vector y consists has n rows corresponding to each of theexamples and 1 column. We want to "learn" the relationship between an individual feature vector x and an outcome y. Insome sense we want to learn the function f : dØ which satisfies:(1)y = f HxL„ Linear ModelsThere are many functions f that we could chose from (I am sure you have some favorites). To simplify our computationand to impose some assumptions (which often aids in generalization) we will restrict f to the class of linear functions. Thatis for a choice of weights w we can express f as:(2)fwHxL =‚j=1dwjxj„ Nonlinear FeaturesOften people find this assumption to restrictive. We can permit a more complex class of functions by creating new(nonlinear) features from the original features xj. For example:(3)fwHxL =‚j=1dwjxj+‚j=d+12 dwjSinAxj2ETo formalize this notion we can rewrite equation 3 as:(4)fwHxL =‚j=1mwjfj@xDReturning to the example in equation 3 we can use the notation of equation 4 by defining:fj@xD =xjif 1 §j§ dSinAxj2E if d + 1 § j § 2 d0 otherwiseThis technique allows us to lift our simple linear function fw into a more complex space permitting a richer class ofPrinted byMathematica for Studentsfunctions in our original space d. With this transformation we can define a matrix F which is like X but consists of thetransformed features. If we do not want to transform our features then we simply define:(5)fj@xD =xjif 1 §j§ d0 otherwiseThe matrix F is constructed by:(6)F=f1@X11,…, X1 dD … fm@X11,…, X1 dD... ... ...f1@Xn1,…, XndD … fm@Xn1,…, XndDIf we use the trivial transform in equation 5 equation 6 becomes:(7)F=X11… X1 d... ... ...Xn1… Xnd= XFor the rest of these notes I will use the trivial feature space X. However feel free to substitute F where ever X is used if anonlinear feature space is desired.„ NoiseSadly we live in the real world where there is random noise e that gets mixed into our observations. So a more naturalmodel would be of the form:(8)y = fwHxL +eWe have to pick what type of noise we expect to observe. A common choice is 0 mean independent gaussian noise of theform:e~NH0, sL„ Which fwHaving selected the feature transformation f and having decided to use a linear model we have reduced our hypothesisspace (the space of functions we are willing to consider for f ) from all the functions (and then some) to linear functions inthe feature space determined by f. The functions in this space are indexed by w (the weight vector). How do we pick ffrom this reduced hypothesis space? We simply choose the "best" w. For the remainder of these notes we will be describinghow to choose the w that maximizes the likelihood of our data X and y.Matrix NotationLets begin with some linear algebra. We can apply our model to the data in the following ways:(9)y =y1…yn=fwH < X11,…,X1 d>L +e1…fwH < Xn1,…,Xnd>L+en=⁄j=1dwj X1 j+e1…⁄j=1dwj Xnj+en= Xw+ewhere w is a d ä1 column vector of weights andeis ad ä1 columnvectorof iidei~NH0, sL gaussian noise. Notice how wecan compactly compute all the y at once by simply multiplying Xw. If we solve for the noise in equation 9 we obtain:y - Xw=e~NH0, s IL;2 mle_regression.nbPrinted byMathematica for Students(10)Hy -XwL ~NH0, sIL;We see that the residual of our regression model follows a multivariate gaussian with covariance s I were I is the identitymatrix. The density of the multivariate Gaussian takes the form:(11)pHVL =1H2 pLNê2…S »1ê2 ExpB-12 HV -mL S-1HV -mLFwhere V ~NHm, SLand V œ Nä1 is a column vector of size N.LikelihoodUsing equation 10 and 11 we can express the likelihood of our data given our weights w as:PHX, y »wL ∂ LHwL ∂ ExpB-12 Hy - XwL1s I Hy - XwLFWe now want to maximize the likelihood of our data given the weights. First we take the Log to make thinks easierlHwL ∂ Hy - XwL I Hy - XwLNotice that we can remove any additional multiplicative constants. We now havelHwL ∂ Hy - XwLrow vectorIidentityMatrixHy - XwLcol vectorYou should be able to convince yourself that this is equivalent to:lHwL ∂ Hy - XwL Hy - XwLNow lets take the gradient (row vector) derivative with respect to w:∑∑wlHwL ∂∑∑w@Hy - XwL Hy - XwLDTo compute this we will use the gradient of a quadratic matrix equation.For more details see http://en.wikipedia.org/wiki/Matrix_calculus http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/calculus.html#deriv_quad∑∑wlHwL ∂-Hy - XwL X - Hy - XwL XSimplifying a little∑∑wlHwL ∂-2 Hy - XwL XRemoving extraneous constants∑∑wlHwL ∂-Hy - XwL XApply the tranposemle_regression.nb 3Printed byMathematica for Students∑∑wlHwL ∂-Hy - w XL XMultiplying through by X:∑∑wlHwL ∂-y X + w X XFinally we set the derivative equal to zero and solve for w to obtain:∑∑wlHwL ∂-y X + w X X = 0w X X = y Xw = y X HX XL-1Finally remvoing the transpose we have:w = HX XL-1 X yThus you have the matrix form of the MLE.4 mle_regression.nbPrinted byMathematica for
View Full Document