Toronto CSC 2515 - Lecture 2: Linear regression - D1720064

Home> Schools> University of Toronto> (CSC) > CSC 2515> Lecture 2: Linear regression

Toronto CSC 2515 - Lecture 2: Linear regression

Course Csc 2515- Introduction to Machine Learning

Pages 38

Download Save

Unformatted text preview:

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 2: Linear regressionLinear modelsSome types of basis function in 1-DTwo types of linear model that are equivalent with respect to learningThe loss functionMinimizing squared errorA geometrical view of the solutionWhen is minimizing the squared error equivalent to Maximum Likelihood Learning?Multiple outputsLeast mean squares: An alternative approach for really big datasetsRegularized least squaresA picture of the effect of the regularizerA problem with the regularizerWhy does shrinkage help?Why shrinkage helpsOther regularizersThe lasso: penalizing the absolute values of the weightsA geometrical view of the lasso compared with a penalty on the squared weightsAn example where minimizing the squared error gives terrible estimatesOne dimensional cross-sections of loss functions with different powersMinimizing the absolute errorThe bias-variance trade-off (a figment of the frequentists lack of imagination?)The bias-variance decompositionHow the regularization parameter affects the bias and variance termsAn example of the bias-variance trade-offBeating the bias-variance trade-offThe Bayesian approachSlide 28Slide 29Using the posterior distributionThe predictive distribution for noisy sinusoidal data modeled by a linear combination of nine radial basis functions.A way to see the covariance of the predictions for different values of xBayesian model comparisonDefinition of the evidenceUsing the evidenceHow the model complexity affects the evidenceDetermining the hyperparameters that specify the variance of the prior and the variance of the output noise.Empirical BayesCSC2515 Fall 2007 Introduction to Machine LearningLecture 2: Linear regressionAll lecture slides will be available as .ppt, .ps, & .htm atwww.cs.toronto.edu/~hintonMany of the figures are provided by Chris Bishop from his textbook: ”Pattern Recognition and Machine Learning”Linear models•It is mathematically easy to fit linear models to data.–We can learn a lot about model-fitting in this relatively simple case.•There are many ways to make linear models more powerful while retaining their nice mathematical properties:–By using non-linear, non-adaptive basis functions, we can get generalised linear models that learn non-linear mappings from input to output but are linear in their parameters – only the linear part of the model learns.–By using kernel methods we can handle expansions of the raw data that use a huge number of non-linear, non-adaptive basis functions. –By using large margin kernel methods we can avoid overfitting even when we use huge numbers of basis functions. •But linear methods will not solve most AI problems.–They have fundamental limitations.Some types of basis function in 1-DSigmoids Gaussians PolynomialsSigmoid and Gaussian basis functions can also be used in multilayer neural networks, but neural networks learn the parameters of the basis functions. This is much more powerful but also much harder and much messier.Two types of linear model that are equivalent with respect to learning•The first model has the same number of adaptive coefficients as the dimensionality of the data +1.•The second model has the same number of adaptive coefficients as the number of basis functions +1.•Once we have replaced the data by the outputs of the basis functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick)–So its silly to clutter up the math with basis functions)(...)()()(...)(2211022110xwxxwx,xwwx,TTwwwyxwxwwybiasThe loss function•Fitting a model to data is typically done by finding the parameter values that minimize some loss function.•There are many possible loss functions. What criterion should we use for choosing one?–Choose one that makes the math easy (squared error)–Choose one that makes the fitting correspond to maximizing the likelihood of the training data given some noise model for the observed outputs. –Choose one that makes it easy to interpret the learned coefficients (easy if mostly zeros)–Choose one that corresponds to the real loss on a practical application (losses are often asymmetric)Minimizing squared errortXXXwxwxwTTnTnnTterrory1*2)()(optimal weightsinverse of the covariance matrix of the input vectorsthe transposed design matrix has one input vector per columnvector of target valuesA geometrical view of the solution•The space has one axis for each training case.•So the vector of target values is a point in the space. •Each vector of the values of one component of the input is also a point in this space.•The input component vectors span a subspace, S.–A weighted sum of the input component vectors must lie in S.•The optimal solution is the orthogonal projection of the vector of target values onto S.3.1 4.2 1.5 2.7 0.6 1.8 input vectorcomponent vectorWhen is minimizing the squared error equivalent to Maximum Likelihood Learning?•Minimizing the squared residuals is equivalent to maximizing the log probability of the correct answer under a Gaussian centered at the model’s guess.t = the correctanswery = model’s estimate of most probable value222)(2)(log2log)|(log21),|()|(),(22nnnnytnnnnnnnytytptnoiseypytpyynnewxwxcan be ignored if sigma is fixedcan be ignored if sigma is same for every caseMultiple outputs•If there are multiple outputs we can often treat the learning problem as a set of independent problems, one per output.–Not true if the output noise is correlated and changes from case to case.•Even though they are independent problems we can save work by only multiplying the input vectors by the inverse covariance of the input components once. For output k we have: kTTktXXXw1*)(does not depend on aLeast mean squares: An alternative approach for really big datasets•This is called “online“ learning. It can be more efficient if the dataset is very redundant and it is simple to implement in hardware.–It is also called stochastic gradient descent if the training cases are picked at random.–Care must be taken with the learning rate to prevent divergent oscillations, and the rate must decrease at the end to get a good fit.)(1nEwwweights after seeing training case tau+1learning ratevector of derivatives of the squared error w.r.t. the

View Full Document


School:
Email:
New Password:
Confirm Password:

Toronto CSC 2515 - Lecture 2: Linear regression

Sign up for free to view:

Please select your school