Machine Learning 10 601 Fall 2013 Linear Regression and Sparsity Eric Xing Lecture 6 September 23 2013 Reading Eric Xing CMU 2013 1 Now you have learned this Representing data Machine Learning What do you call this Eric Xing CMU 2006 2012 2 A slightly different problem Machine learning for apartment hunting Now you ve moved to Pittsburgh And you want to find the most reasonably priced apartment satisfying your needs square ft of bedroom distance to campus Living area ft2 bedroom Rent 230 1 600 506 2 1000 433 2 1100 109 1 500 150 1 270 1 5 Eric Xing CMU 2013 3 The learning problem rent Living area Features Living area distance to campus bedroom Denote as x x1 x2 xk Target Rent Denoted as y Training set rent x1 x 2 X x n Location Living area y1 y Y 2 yn x11 x12 x12 x22 1 xn xn2 y 1 y 2 or y n Eric Xing CMU 2013 x1k x2k xnk 4 Linear Regression Assume that Y target is a linear function of X features e g let s assume a vacuous feature X0 1 this is the intercept term why and define the feature vector to be then we have the following general representation of the linear function y 0 1 x1 2 x 2 Our goal is to pick the optimal How We seek that minimize the following cost function 1 n J y i xi yi 2 2 i 1 Eric Xing CMU 2013 5 The Least Mean Square LMS method The Cost Function 1 n T J x i yi 2 2 i 1 Consider a gradient descent algorithm j t 1 j J j t Eric Xing CMU 2013 t 6 The Least Mean Square LMS method Now we have the following descent rule j t 1 n T j y i x i t xi j t i 1 For a single training point we have This is known as the LMS update rule or the Widrow Hoff learning rule This is actually a stochastic coordinate descent algorithm This can be used as a on line algorithm Eric Xing CMU 2013 7 Steepest Descent and LMS Steepest descent Note that T n T J J J yn x n x n k i 1 1 t 1 n yn x n t x n t T i 1 This is as a batch gradient descent algorithm Eric Xing CMU 2013 8 The normal equations Write the cost function in matrix form 1 T x i yi 2 2 i 1 1 T X y X y 2 1 T X T X T X T y y T X y T y 2 J x1 x 2 X x n n y1 y2 y yn To minimize J take derivative and set to zero 1 tr T X T X T X T y y T X y T y 2 1 tr T X T X 2 try T X try T y 2 1 X T X X T X 2 X T y 2 X T X X T y 0 J X T X X T y The normal equations Eric Xing CMU 2013 X X T 1 X y T 9 Comments on the normal equation In most situations of practical interest the number of data points N is larger than the dimensionality k of the input space and the matrix X is of full column rank If this condition holds then it is easy to verify that XTX is necessarily invertible The assumption that XTX is invertible implies that it is positive definite thus at the critical point we have found is a minimum What if X has less than full column rank regularization later Eric Xing CMU 2013 10 Direct and Iterative methods Direct methods we can achieve the solution in a single step by solving the normal equation Using Gaussian elimination or QR decomposition we converge in a finite number of steps It can be infeasible when data are streaming in in real time or of very large amount Iterative methods stochastic or steepest gradient Converging in a limiting sense But more attractive in large practical problems Caution is needed for deciding the learning rate Eric Xing CMU 2013 11 Convergence rate Theorem the steepest descent equation algorithm converge to the minimum of the cost characterized by normal equation If A formal analysis of LMS need more math mussels in practice one can use a small or gradually decrease Eric Xing CMU 2013 12 A Summary LMS update rule j t 1 j y n x n t x n i t T Pros on line low per step cost fast convergence and perhaps less prone to local optimum Cons convergence to optimum not always guaranteed Steepest descent n T t 1 t yn x n t x n i 1 Pros easy to implement conceptually clean guaranteed convergence Cons batch often slow converging Normal equations X X T 1 X y T Pros a single shot algorithm Easiest to implement Cons need to compute pseudo inverse XTX 1 expensive numerical issues e g matrix is singular although there are ways to get around this Eric Xing CMU 2013 13 Probabilistic Interpretation of LMS Let us assume that the target variable and the inputs are related by the equation yi T x i i where is an error term of unmodeled effects or random noise Now assume that follows a Gaussian N 0 then we have yi T x i 2 1 p y i xi exp 2 2 2 By independence assumption n L i 1 n T 2 y x 1 i i p y i xi exp i 1 2 2 2 n Eric Xing CMU 2013 14 Probabilistic Interpretation of LMS cont Hence the log likelihood is l n log Do you recognize the last term Yes it is 1 1 1 n 2 i 1 y i T x i 2 2 2 1 n T J x i yi 2 2 i 1 Thus under independence assumption LMS is equivalent to MLE of Eric Xing CMU 2013 15 Case study predicting gene expression The genetic picture causal SNPs CGTTTCACTGTACAATTT a univariate phenotype i e the expression intensity of a gene Eric Xing CMU 2013 16 Association Mapping as Regression Phenotype BMI Individual 1 2 5 Genotype C T C T C A C T Individual 2 4 8 G A G A C T C T Individual N 4 7 G T C T G T G T Benign SNPs Eric Xing CMU 2013 Causal SNP 17 Association Mapping as Regression Phenotype BMI Genotype 2 5 0 1 0 0 Individual 2 4 8 1 1 1 1 4 7 2 2 1 0 Individual 1 Individual N J yi xij j j 1 Eric Xing CMU 2013 SNPs with large j are relevant 18 Experimental setup Asthama dataset 543 individuals genotyped at 34 SNPs Diploid data was transformed into 0 1 for homozygotes or 2 for heterozygotes X 543x34 matrix Y Phenotype variable continuous A single phenotype was used for regression Implementation details …
View Full Document
Unlocking...