Matrix MLE for Linear Regression Joseph E Gonzalez Some people have had some trouble with the linear algebra form of the MLE for multiple regression I tried to find a nice online derivation but I could not find anything helpful So I have decide to derive the matrix form for the MLE weights for linear regression under the assumption of Gaussian noise The Model Lets say we are given some set of data X and y The matrix X has n rows corresponding to each of the examples and d columns corresponding to each of the d features The column vector y consists has n rows corresponding to each of the examples and 1 column We want to learn the relationship between an individual feature vector x and an outcome y In some sense we want to learn the function f d which satisfies y f HxL 1 Linear Models There are many functions f that we could chose from I am sure you have some favorites To simplify our computation and to impose some assumptions which often aids in generalization we will restrict f to the class of linear functions That is for a choice of weights w we can express f as fw HxL w j x j d 2 j 1 Nonlinear Features Often people find this assumption to restrictive We can permit a more complex class of functions by creating new nonlinear features from the original features x j For example fw HxL w j x j w j SinAx2j E d 2d j 1 j d 1 3 To formalize this notion we can rewrite equation 3 as fw HxL w j f j xD m 4 j 1 Returning to the example in equation 3 we can use the notation of equation 4 by defining f j xD xj if 1 j d 0 otherwise SinAx2j E if d 1 j 2 d This technique allows us to lift our simple linear function fw into a more complex space permitting a richer class of Printed by Mathematica for Students 2 mle regression nb functions in our original space d With this transformation we can define a matrix F which is like X but consists of the transformed features If we do not want to transform our features then we simply define f j xD The matrix F is constructed by x j if 1 j d 0 5 otherwise f1 X11 X1 d D fm X11 X1 d D F f1 Xn1 Xnd D fm Xn1 Xnd D 6 If we use the trivial transform in equation 5 equation 6 becomes F X11 X1 d Xn1 Xnd X 7 For the rest of these notes I will use the trivial feature space X However feel free to substitute F where ever X is used if a nonlinear feature space is desired Noise Sadly we live in the real world where there is random noise e that gets mixed into our observations So a more natural model would be of the form y fw HxL e 8 We have to pick what type of noise we expect to observe A common choice is 0 mean independent gaussian noise of the form e NH0 sL Which fw Having selected the feature transformation f and having decided to use a linear model we have reduced our hypothesis space the space of functions we are willing to consider for f from all the functions and then some to linear functions in the feature space determined by f The functions in this space are indexed by w the weight vector How do we pick f from this reduced hypothesis space We simply choose the best w For the remainder of these notes we will be describing how to choose the w that maximizes the likelihood of our data X and y Matrix Notation Lets begin with some linear algebra We can apply our model to the data in the following ways y1 y yn d j 1 w j X1 j e1 fw H X11 X1 d L e1 fw H Xn1 Xnd L en dj 1 w j Xw e 9 Xnj en where w is a d 1 column vector of weights and e is a d 1 column vector of iid ei NH0 sL gaussian noise Notice how we can compactly compute all the y at once by simply multiplying X w If we solve for the noise in equation 9 we obtain y X w e NH0 s IL Printed by Mathematica for Students mle regression nb Hy X wL NH0 s IL 10 We see that the residual of our regression model follows a multivariate gaussian with covariance s I were I is the identity matrix The density of the multivariate Gaussian takes the form pHV L H2 pL 1 N 2 S ExpB1 2 1 2 HV mL S 1 HV mLF where V NHm SL and V N 1 is a column vector of size N Likelihood Using equation 10 and 11 we can express the likelihood of our data given our weights w as PHX y wL LHwL ExpB 1 2 Hy X wL 1 s I Hy X wLF We now want to maximize the likelihood of our data given the weights First we take the Log to make thinks easier lHwL Hy X wL I Hy X wL Notice that we can remove any additional multiplicative constants We now have lHwL Hy X wL row vector I Hy X wL identity Matrix col vector You should be able to convince yourself that this is equivalent to lHwL Hy X wL Hy X wL Now lets take the gradient row vector derivative with respect to w w lHwL w Hy X wL Hy X wLD To compute this we will use the gradient of a quadratic matrix equation For more details see http en wikipedia org wiki Matrix calculus http www ee ic ac uk hp staff dmb matrix calculus html deriv quad w lHwL Hy X wL X Hy X wL X Simplifying a little w lHwL 2 Hy X wL X Removing extraneous constants w lHwL Hy X wL X Apply the tranpose Printed by Mathematica for Students 11 3 4 mle regression nb w lHwL Hy w X L X Multiplying through by X w lHwL y X w X X Finally we set the derivative equal to zero and solve for w to obtain w lHwL y X w X X 0 w X X y X w y X HX X L 1 Finally remvoing the transpose we have w HX X L 1 X y Thus you have the matrix form of the MLE Printed by Mathematica for Students
View Full Document