UT Dallas CS 6375 - 14.vc - D3094469

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6375> 14.vc

DOC PREVIEW

UT Dallas CS 6375 - 14.vc

School name University of Texas at Dallas

Course Cs 6375- Machine Learning

Pages 24

This preview shows page 1-2-23-24 out of 24 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Machine Learning CS6375 Spring 2015 a Learning Theory Instructor Yang Liu Acknowledgement Vincent Ng Andrew Moore Tom Mitchell 1 Learning Theory Bias and variance Probably approximately correct PAC learning Vapnik Chervonenkis VC dimension Sample complexity Mistake bound 2 1 Model Loss Error Let D be a set of training instances Squared loss of model on test case x Learn x D Truth x 2 Expected prediction error E Learn x D Truth x 2 D 3 Bias Variance Decomposition E L x D T x 2 D Noise2 Bias 2 Variance Noise 2 lower bound on performance Bias 2 expected error due to model mismatch 2 Variance variation due to training sample 4 2 Regression Illustration for Bias Variance True function is y f x where is N 0 2 Given a set of training examples xi yi we fit a hypothesis h x wx b to minimize the squared error i yi h xi 2 Now given a new data point x with observation value y f x we want to understand the expected predicted error E y h x 2 Assume data samples are generated based on distribution P we are going to decompose the expected error into bias variance and noise 5 Y x 2 sin 1 5 x N 0 0 2 6 3 Bias Variance Decomposition Lemma Var Z E Z Z 2 E Z2 Z2 E Z2 E Z Z 2 Z2 7 Bias Variance Decomposition Expected predicted error Variance bias2 noise2 8 4 Bias Variance Decomposition Noise 2 describes how much y varies from f x Bias h x f x describes the average error of h x Variance E h x h x 2 describes how much h x varies from one training set to another Note Similar theory can be used for classification 9 Bias Low bias linear regression applied to linear data 2nd degree polynomial applied to quadratic data High bias constant function linear regression applied to non linear data neural net with few hidden units applied to non linear data 10 5 Variance Low variance constant function model independent of training data High variance high degree polynomial neural net with many hidden units trained to completion 11 Bias Variance Tradeoff bias2 variance is what counts for prediction Often low bias high variance low variance high bias Tradeoff bias2 vs variance 12 6 Prediction error Bias Variance Tradeoff Model complexity Hastie Tibshirani Friedman Elements of Statistical Learning 2001 13 Complexity of Models E bias2 var var bias2 complexity Usually the bias is a decreasing function of the complexity while variance is an increasing function of the complexity 14 7 Bias Variance in Ensemble Averaging reduces variance Var X Var X N Bagging Reduces variance by averaging if bootstrap replicate approximation were correct it has little effect on bias In practice models are correlated so reduction is smaller than 1 N Boosting Early iterations it is primarily a bias reducing method later iterations primarily variance reducing 15 Bias and Variance Summary For regression problems squared error loss the expected error rate can be decomposed into bias2 Variance Noise Bias arises when the classifier can t represent the true function underfits the data Variance arises when the classifier overfits the data There is often a tradeoff between bias and variance k NN increasing k typically increases bias and reduces variance decision tree increasing depth typically increases variance and reduces bias 16 8 Learning Theory What general laws constrain inductive learning We seek theory to relate Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples presented 17 Probably Approximately Correct PAC Learning How many training examples are sufficient to learn the target concept Given Set of instances X Set of hypotheses H Set of possible target concept C Training instances generated by a fixed unknown probability distribution D over X Learner observes a sequence D of training examples of form x c x for some target concept c Learner must output a hypothesis h estimating c Two notions of error training and testing error 18 9 PAC Learning Choose the number of samples R such that with probability less than we will select a bad hest which makes mistakes more than fraction of the time PAC This can be achieved by choosing R such that 1 1 ln H ln H is the number of hypotheses 19 PAC Learning Imagine we are doing classification with categorical inputs All inputs and outputs are binary Data is noiseless There is a machine f x h which has H possible hypotheses called h1 h2 hH 20 10 Example 1 f x h consists of all logical sentences about x1 x2 xm that contain only logical ands Example x1 x3 x9 Question if there are m attributes how many hypotheses in f H 2m R 0 69 1 m log 2 21 Example 2 f x h consists of all logical sentences about x1 x2 xm or their negations that contain only logical ands Example x1 x3 x9 Question if there are m attributes what is the size of the complete set of hypotheses in f H 3m R 0 69 1 log 2 3 m log 2 22 11 Example 3 f x h consists of all truth tables mapping combinations of input attributes to true and false Question if there are m attributes what is the size of the complete set of hypotheses in f R 0 69 1 2 m log 2 23 PAC Learning R 1 1 ln H ln What about very large H or infinite hyp space Consider another measure of the complexity of hypothesis space VC dimension 24 12 A Learning Machine A learning machine f takes an input x and transforms it somehow using weights into a predicted output yest 1 25 Example 1 26 13 Example 2 27 Example 3 28 14 How do we characterize power Different machines have different amounts of power Tradeoff between More power Can model more complex classifiers but might overfit Less power Not going to overfit but restricted in what it can model How do we characterize the amount of power 29 Some Definitions Given some machine f And under the assumption that all training points xk yk were drawn from some distribution And under the assumption that future test points will be drawn from the same distribution Define 30 15 Some Definitions Given some machine f And under the assumption that all training points xk yk were drawn from some distribution And under the assumption that future test points will be drawn from the same distribution Define 31 VC Dimension Given some machine f let h be its VC dimension h is a measure of f s power h does not depend on the choice of training set Two questions How to compute the VC dimension of a machine Why care about VC dimension 32 16 Shattering Machine f can shatter a set of points x1 x2 xr if and only if For every possible training set of the

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-23-24 out of 24 pages.

UT Dallas CS 6375 - 14.vc

Sign up for free to view:

Please select your school