Machine Learning 1010 701 15701 15 781 Fall 2008 Introduction to Regression Eric Xing Lecture 4 January 28 2006 Reading Chap 3 CB Functional Approximation cont 1 Machine learning for apartment hunting z Now you ve moved to Pittsburgh And you want to find the most reasonably priced apartment satisfying your needs square ft of bedroom distance to campus Living area ft2 bedroom Rent 230 1 600 506 2 1000 433 2 1100 109 1 500 150 1 270 1 5 The learning problem rent z z Living area z Features z Living area distance to campus bedroom z Denote as x x1 x2 xk Target z Rent z Denoted as y Training set rent x1 x 2 X M M x n Location Living area y1 y Y 2 M yn x11 x12 M M x1n or x12 K x1k x22 K x2k M M M xn2 K xnk y 1 y 2 M M y n M 2 Linear Regression z z Assume that Y target is a linear function of X features z e g z let s assume a vacuous feature X0 1 this is the intercept term why and define the feature vector to be z then we have the following general representation of the linear function y 0 1 x 1 2 x 2 Our goal is to pick the optimal How z We seek that minimize the following cost function J 1 n v y i xi yi 2 2 i 1 The Least Mean Square LMS method z The Cost Function J z 1 n T x i yi 2 2 i 1 Consider a gradient descent algorithm j t 1 j t J j t 3 The Least Mean Square LMS method z Now we have the following descent rule n j t 1 j t yn x nT t xn i i 1 z For a single training point we have z This is known as the LMS update rule or the Widrow Hoff learning rule z This is actually a stochastic coordinate descent algorithm z This can be used as a on line algorithm The Least Mean Square LMS method z Steepest descent z Note that T n T J J K J yn x n x n i 1 k 1 n t 1 t yn x nT t x n i 1 z This is as a batch gradient descent algorithm 4 Some matrix derivatives z z For f R m n a R define A f 11 A f A M f A1 m Trace n trA Aii f A1 n O M L f Amn L tra a trABC trCAB trBCA i 1 z Some fact of matrix derivatives without proof A trAB BT A A A A 1 A trABAT C CAB C T ABT T The normal equations z Write the cost function in matrix form 1 n T J x i yi 2 2 i 1 1 vT v X y X y 2 1 v v v v T X T X T X T y y T X y T y 2 z X M y1 v y2 y M yn x1 x2 M xn M To minimize J take derivative and set to zero 1 v v v v tr T X T X T X T y y T X y T y 2 1 v v v tr T X T X 2 try T X try T y 2 1 v X T X X T X 2 X T y 2 v X T X X T y 0 J v X T X X T y The normal equations X X X T y T 1 v 5 A recap z LMS update rule j t 1 j t yn x nT t xn i z z Pros on line low per step cost z Cons coordinate maybe slow converging Steepest descent n t 1 t yn x nT t x n i 1 z z Pros fast converging easy to implement z Cons a batch Normal equations X T X X T y v 1 z Pros a single shot algorithm Easiest to implement z Cons need to compute pseudo inverse XTX 1 expensive numerical issues e g matrix is singular Geometric Interpretation of LMS z The predictions on the training data are v y X X X T X z Note that v v y y X X T X and v y 1 X X X 1 v XT I y X 0 1 v X XT y T 1 v XT I y v v XT y y XT X XTX T T v XT y x1 x 2 X M M x n M is the orthogonal projection of v y into the space spanned by the columns of X 6 Probabilistic Interpretation of LMS z Let us assume that the target variable and the inputs are related by the equation yi T x i i where is an error term of unmodeled effects or random noise z Now assume that follows a Gaussian N 0 then we have p yi xi z y T x i 2 1 exp i 2 2 2 By independence assumption n n yi T x i 2 n 1 L p yi xi exp i 1 2 2 2 i 1 Probabilistic Interpretation of LMS cont z Hence the log likelihood is l n log z Do you recognize the last term Yes it is z 1 1 1 n 2 yi T x i 2 2 2 i 1 J 1 n T x i yi 2 2 i 1 Thus under independence assumption LMS is equivalent to MLE of 7 Beyond basic LR z LR with non linear basis functions z Locally weighted linear regression z Regression trees and Multilinear Interpolation LR with non linear basis functions z LR does not mean we can only deal with linear relationships z We are free to design non linear features under LR y 0 j 1 j x T x m where the j x are fixed basis functions and we define 0 x 1 z Example polynomial regression x 1 x x2 x3 z We will be concerned with estimating distributions over the weights and choosing the model order M 8 Basis functions z There are many basis functions e g z z Polynomial j x x j 1 x j 2 Radial basis functions j x exp 2 2s x j s j x z Sigmoidal z Splines Fourier Wavelets etc 1D and 2D RBFs z 1D RBF z After fit 9 Good and Bad RBFs z A good 2D RBF z Two bad 2D RBFs Locally weighted linear regression z Overfitting and underfitting y 0 1 x y 0 1 x 2 x 2 y j 0 j x j 5 10 Bias and variance z we define the bias of a model to be the expected generalization error even if we were to train it to a very say infinitely large training set z By fitting spurious patterns in the training set we might again obtain a model …
View Full Document