Pitt CS 2750 - Designing a learning system

Unformatted text preview:

1CS 2750 Machine LearningCS 2750 Machine LearningMilos [email protected] Sennott Square, x4-8845http://www.cs.pitt.edu/~milos/courses/cs2750/Lecture 2Designing a learning systemCS 2750 Machine LearningTypical learningThree basic steps:• Select a model or a set of models (with parameters)E.g.• Select the error function to be optimizedE.g.• Find the set of parameters optimizing the error function– The model and parameters with the smallest error represent the best fit of the model to the dataBut there are problems one must be careful about …baxy +=21))((1iinixfyn−∑=2CS 2750 Machine LearningLearningProblem• We fit the model based on past experience (past examples seen)• But ultimately we are interested in learning the mapping that performs well on the whole population of examplesTraining data: Data used to fit the parameters of the modelTraining error:True (generalization) error (over the whole unknown population):Training error tries to approximate the true error !!!!Does a good training error imply a good generalization error ?21))((1iinixfyn−∑=]))([(2),(xfyEyx−Mean squared errorCS 2750 Machine LearningOverfitting• Assume we have a set of 10 points and we consider polynomial functions as our possible models-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-8-6-4-202468103CS 2750 Machine LearningOverfitting• Fitting a linear function with the square error• Error is nonzero-2 -1.5 -1 -0. 5 0 0.5 1 1.5 2-8-6-4-2024681012CS 2750 Machine LearningOverfitting• Linear vs. cubic polynomial• Higher order polynomial leads to a better fit, smaller error -2 -1.5 -1 -0. 5 0 0.5 1 1.5 2-8-6-4-20246810124CS 2750 Machine LearningOverfitting• Is it always good to minimize the error of the observed data?-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-8-6-4-2024681012CS 2750 Machine LearningOverfitting• For 10 data points, the degree 9 polynomial gives a perfect fit (Lagrange interpolation). Error is zero.• Is it always good to minimize the training error? -1.5 -1 -0. 5 0 0.5 1 1.5-8-6-4-202468105CS 2750 Machine LearningOverfitting• For 10 data points, degree 9 polynomial gives a perfect fit (Lagrange interpolation). Error is zero.• Is it always good to minimize the training error? NO !!•More important: How do we perform on the unseen data?-1.5 -1 -0. 5 0 0.5 1 1.5-8-6-4-20246810CS 2750 Machine LearningOverfittingSituation when the training error is low and the generalization error is high. Causes of the phenomenon:• Model with a large number of parameters (degrees of freedom)• Small data size (as compared to the complexity of the model)-1.5 -1 -0. 5 0 0.5 1 1.5-8-6-4-202468106CS 2750 Machine LearningHow to evaluate the learner’s performance?• Generalization error is the true error for the population of examples we would like to optimize• But it cannot be computed exactly•Sample mean only approximates the true mean• Optimizing (mean) training error can lead to the overfit, i.e. training error may not reflect properly the generalization error• So how to test the generalization error? ]))([(2),(xfyEyx−2,..1))((1iinixfyn−∑=CS 2750 Machine Learning• Generalization error is the true error for the population of examples we would like to optimize•Sample mean only approximates it• How to measure the generalization error? •Two ways:– Theoretical: Law of Large numbers• statistical bounds on the difference between true and sample mean errors–Practical: Use a separate data set with m data samples to test•(Mean) test error2,..1))((1jjmjxfym−∑=How to evaluate the learner’s performance?]))([(2),(xfyEyx−7CS 2750 Machine Learning1. Take a dataset D and divide it into:• Training data set • Testing data set 2. Use the training set and your favorite ML algorithm to train the learner3. Test (evaluate) the learner on the testing data set• The results on the testing set can be used to compare different learners powered with different models and learning algorithmsBasic experimental setup to test the learner’s performanceCS 2750 Machine LearningDesign of a learning system (first view)DataModel selectionLearningApplicationor Testing8CS 2750 Machine LearningDesign of a learning system.1. Data:2. Model selection:• Select a model or a set of models (with parameters)E.g.•Select the error function to be optimizedE.g.3. Learning:• Find the set of parameters optimizing the error function– The model and parameters with the smallest error 4. Application:• Apply the learned model– E.g. predict ys for new inputs x using learned),0(σεN=ε++= baxy21))((1iinixfyn−∑=},..,,{21 ndddD =)(xfCS 2750 Machine LearningDesign cycle DataFeature selectionModel selectionLearningEvaluationRequire some prior knowledge9CS 2750 Machine LearningDesign cycleDataFeature selectionModel selectionLearningEvaluationRequire prior knowledgeCS 2750 Machine LearningDataData may need a lot of:• Cleaning• Preprocessing (conversions)Cleaning:– Get rid of errors, noise– Removal of redundanciesPreprocessing:– Renaming – Rescaling (normalization)– Discretizations– Abstraction– Agreggation– New attributes10CS 2750 Machine LearningData preprocessing• Renaming (relabeling) categorical values to numbers– dangerous in conjunction with some learning methods– numbers will impose an order that is not warrantied•Rescaling (normalization): continuous values transformed to some range, typically [-1, 1] or [0,1].•Discretizations (binning): continuous values to a finite set of discrete values•Abstraction: merge together categorical values•Aggregation: summary or aggregation operations, such minimum value, maximum value etc.•New attributes:– example: obesity-factor = weight/heightCS 2750 Machine LearningData biases• Watch out for data biases:– Try to understand the data source– It is very easy to derive “unexpected” results when data used for analysis and learning are biased (pre-selected)– Results (conclusions) derived for pre-selected data do not hold in general !!!11CS 2750 Machine LearningData biasesExample 1: Risks in pregnancy study– Sponsored by DARPA at military hospitals– Study of a large sample of pregnant woman–Conclusion: the factor with the largest impact on reducing risks during pregnancy (statistically significant) is a pregnant woman being single – Single woman -> the smallest risk – What is wrong?CS 2750 Machine LearningDataExample 2: Stock market trading (example by Andrew Lo)– Data on stock performances of


View Full Document

Pitt CS 2750 - Designing a learning system

Download Designing a learning system
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Designing a learning system and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Designing a learning system 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?