UB CSE 574 - Basic Concepts in Machine Learning - D1932830

Home> Schools> University at Buffalo, The State University of New York> Computer Science & Engineering (CSE) > CSE 574> Basic Concepts in Machine Learning

DOC PREVIEW

UB CSE 574 - Basic Concepts in Machine Learning

School name University at Buffalo, The State University of New York

Course Cse 574- Introduction to Machine Learning

Pages 17

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Machine Learning ! !! ! !Srihari 1 Basic Concepts in Machine Learning Sargur N. SrihariMachine Learning ! !! ! !Srihari Introduction to ML: Topics 1. Polynomial Curve Fitting 2. Probability Theory of multiple variables 3. Maximum Likelihood 4. Bayesian Approach 5. Model Selection 6. Curse of Dimensionality 2Machine Learning ! !! ! !Srihari 3 Polynomial Curve Fitting Sargur N. SrihariMachine Learning ! !! ! !Srihari 4 Simple Regression Problem • Observe Real-valued input variable x • Use x to predict value of target variable t • Synthetic data generated from sin(2π x) • Random noise in target values Target Variable Input VariableMachine Learning ! !! ! !Srihari 5 Notation • N observations of x x = (x1,..,xN)T t = (t1,..,tN)T • Goal is to exploit training set to predict value of from x • Inherently a difficult problem • Probability theory allows us to make a prediction Data Generation: N = 10 Spaced uniformly in range [0,1] Generated from sin(2πx) by adding small Gaussian noise Noise typical due to unobserved variablesMachine Learning ! !! ! !Srihari 6 Polynomial Curve Fitting • Polynomial function • Where M is the order of the polynomial • Is higher value of M better? We’ll see shortly! • Coefficients w0 ,…wM are denoted by vector w • Nonlinear function of x, linear function of coefficients w • Called Linear ModelsMachine Learning ! !! ! !Srihari 7 Error Function • Sum of squares of the errors between the predictions y(xn,w) for each data point xn and target value tn • Factor ½ included for later convenience • Solve by choosing value of w for which E(w) is as small as possible Red line is best polynomial fitMachine Learning ! !! ! !Srihari 8 Minimization of Error Function • Error function is a quadratic in coefficients w • Derivative with respect to coefficients will be linear in elements of w • Thus error function has a unique minimum denoted w* • Resulting polynomial is y(x,w*) € Since y(x,w) = wjxjj= 0M∑ ∂E(w)∂wi= {y(xn,w) − tn}xnin=1N∑ = { wjxnjj=0M∑n =1N∑− tn}xniSetting equal to zerowjxni+ jj= 0M∑n=1N∑= tnxnin=1N∑Set of M +1 equations (i = 0,.., M) are solved to get elements of w *Machine Learning ! !! ! !Srihari 9 Choosing the order of M • Model Comparison or Model Selection • Red lines are best fits with – M = 0,1,3,9 and N=10 Poor representations of sin(2πx) Best Fit to sin(2πx) Over Fit Poor representation of sin(2πx)Machine Learning ! !! ! !Srihari 10 Generalization Performance • Consider separate test set of 100 points • For each value of M evaluate for training data and test data • Use RMS error – Division by N allows different sizes of N to be compared on equal footing – Square root ensures ERMS is measured in same units as t Poor due to Inflexible polynomials Small Error M=9 means ten degrees of freedom. Tuned exactly to 10 training points (wild oscillations in polynomial) € E(w*) =12{y(xn,w*) − tn}2n=1N∑€ y(x,w*) = wj*xjj= 0M∑Machine Learning ! !! ! !Srihari 11 Values of Coefficients w* for different polynomials of order M As M increases magnitude of coefficients increases At M=9 finely tuned to random noise in target valuesMachine Learning ! !! ! !Srihari 12 Increasing Size of Data Set N=15, 100 For a given model complexity overfitting problem is less severe as size of data set increases Larger the data set, the more complex we can afford to fit the data Data should be no less than 5 to 10 times adaptive parameters in modelMachine Learning ! !! ! !Srihari 13 Least Squares is case of Maximum Likelihood • Unsatisfying to limit the number of parameters to size of training set • More reasonable to choose model complexity according to problem complexity • Least squares approach is a specific case of maximum likelihood – Over-fitting is a general property of maximum likelihood • Bayesian approach avoids over-fitting problem – No. of parameters can greatly exceed no. of data points – Effective no. of parameters adapts automatically to size of data setMachine Learning ! !! ! !Srihari 14 Regularization of Least Squares • Using relatively complex models with data sets of limited size • Add a penalty term to error function to discourage coefficients from reaching large values • λ determines relative importance of regularization term to error term • Can be minimized exactly in closed form • Known as shrinkage in statistics Weight decay in neural networksMachine Learning ! !! ! !Srihari 15 Effect of Regularizer M=9 polynomials using regularized error function No Regularizer λ = 0#Optimal Large Regularizer λ = 1#λ = 0#Large Regularizer No RegularizerMachine Learning ! !! ! !Srihari 16 Impact of Regularization on Error • λ controls the complexity of the model and hence degree of over-fitting – Analogous to choice of M • Suggested Approach: • Training set – to determine coefficients w – For different values of (M or λ)#• Validation set (holdout) – to optimize model complexity (M or λ) M=9 polynomialMachine Learning ! !! ! !Srihari 17 Concluding Remarks on Regression • Approach suggests partitioning data into training set to determine coefficients w • Separate validation set (or hold-out set) to optimize model complexity M or λ#• More sophisticated approaches are not as wasteful of training data • More principled approach is based on probability theory • Classification is a special case of regression where target value is discrete

View Full Document