DOC PREVIEW
UT Dallas CS 6375 - 14.vc

This preview shows page 1-2-23-24 out of 24 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine Learning CS6375 Spring 2015 a Learning Theory Instructor Yang Liu Acknowledgement Vincent Ng Andrew Moore Tom Mitchell 1 Learning Theory Bias and variance Probably approximately correct PAC learning Vapnik Chervonenkis VC dimension Sample complexity Mistake bound 2 1 Model Loss Error Let D be a set of training instances Squared loss of model on test case x Learn x D Truth x 2 Expected prediction error E Learn x D Truth x 2 D 3 Bias Variance Decomposition E L x D T x 2 D Noise2 Bias 2 Variance Noise 2 lower bound on performance Bias 2 expected error due to model mismatch 2 Variance variation due to training sample 4 2 Regression Illustration for Bias Variance True function is y f x where is N 0 2 Given a set of training examples xi yi we fit a hypothesis h x wx b to minimize the squared error i yi h xi 2 Now given a new data point x with observation value y f x we want to understand the expected predicted error E y h x 2 Assume data samples are generated based on distribution P we are going to decompose the expected error into bias variance and noise 5 Y x 2 sin 1 5 x N 0 0 2 6 3 Bias Variance Decomposition Lemma Var Z E Z Z 2 E Z2 Z2 E Z2 E Z Z 2 Z2 7 Bias Variance Decomposition Expected predicted error Variance bias2 noise2 8 4 Bias Variance Decomposition Noise 2 describes how much y varies from f x Bias h x f x describes the average error of h x Variance E h x h x 2 describes how much h x varies from one training set to another Note Similar theory can be used for classification 9 Bias Low bias linear regression applied to linear data 2nd degree polynomial applied to quadratic data High bias constant function linear regression applied to non linear data neural net with few hidden units applied to non linear data 10 5 Variance Low variance constant function model independent of training data High variance high degree polynomial neural net with many hidden units trained to completion 11 Bias Variance Tradeoff bias2 variance is what counts for prediction Often low bias high variance low variance high bias Tradeoff bias2 vs variance 12 6 Prediction error Bias Variance Tradeoff Model complexity Hastie Tibshirani Friedman Elements of Statistical Learning 2001 13 Complexity of Models E bias2 var var bias2 complexity Usually the bias is a decreasing function of the complexity while variance is an increasing function of the complexity 14 7 Bias Variance in Ensemble Averaging reduces variance Var X Var X N Bagging Reduces variance by averaging if bootstrap replicate approximation were correct it has little effect on bias In practice models are correlated so reduction is smaller than 1 N Boosting Early iterations it is primarily a bias reducing method later iterations primarily variance reducing 15 Bias and Variance Summary For regression problems squared error loss the expected error rate can be decomposed into bias2 Variance Noise Bias arises when the classifier can t represent the true function underfits the data Variance arises when the classifier overfits the data There is often a tradeoff between bias and variance k NN increasing k typically increases bias and reduces variance decision tree increasing depth typically increases variance and reduces bias 16 8 Learning Theory What general laws constrain inductive learning We seek theory to relate Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples presented 17 Probably Approximately Correct PAC Learning How many training examples are sufficient to learn the target concept Given Set of instances X Set of hypotheses H Set of possible target concept C Training instances generated by a fixed unknown probability distribution D over X Learner observes a sequence D of training examples of form x c x for some target concept c Learner must output a hypothesis h estimating c Two notions of error training and testing error 18 9 PAC Learning Choose the number of samples R such that with probability less than we will select a bad hest which makes mistakes more than fraction of the time PAC This can be achieved by choosing R such that 1 1 ln H ln H is the number of hypotheses 19 PAC Learning Imagine we are doing classification with categorical inputs All inputs and outputs are binary Data is noiseless There is a machine f x h which has H possible hypotheses called h1 h2 hH 20 10 Example 1 f x h consists of all logical sentences about x1 x2 xm that contain only logical ands Example x1 x3 x9 Question if there are m attributes how many hypotheses in f H 2m R 0 69 1 m log 2 21 Example 2 f x h consists of all logical sentences about x1 x2 xm or their negations that contain only logical ands Example x1 x3 x9 Question if there are m attributes what is the size of the complete set of hypotheses in f H 3m R 0 69 1 log 2 3 m log 2 22 11 Example 3 f x h consists of all truth tables mapping combinations of input attributes to true and false Question if there are m attributes what is the size of the complete set of hypotheses in f R 0 69 1 2 m log 2 23 PAC Learning R 1 1 ln H ln What about very large H or infinite hyp space Consider another measure of the complexity of hypothesis space VC dimension 24 12 A Learning Machine A learning machine f takes an input x and transforms it somehow using weights into a predicted output yest 1 25 Example 1 26 13 Example 2 27 Example 3 28 14 How do we characterize power Different machines have different amounts of power Tradeoff between More power Can model more complex classifiers but might overfit Less power Not going to overfit but restricted in what it can model How do we characterize the amount of power 29 Some Definitions Given some machine f And under the assumption that all training points xk yk were drawn from some distribution And under the assumption that future test points will be drawn from the same distribution Define 30 15 Some Definitions Given some machine f And under the assumption that all training points xk yk were drawn from some distribution And under the assumption that future test points will be drawn from the same distribution Define 31 VC Dimension Given some machine f let h be its VC dimension h is a measure of f s power h does not depend on the choice of training set Two questions How to compute the VC dimension of a machine Why care about VC dimension 32 16 Shattering Machine f can shatter a set of points x1 x2 xr if and only if For every possible training set of the


View Full Document

UT Dallas CS 6375 - 14.vc

Documents in this Course
ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw5

hw5

2 pages

hw3

hw3

3 pages

20.mdp

20.mdp

19 pages

19.em

19.em

17 pages

16.svm-2

16.svm-2

16 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

21.rl

21.rl

18 pages

CNF-DNF

CNF-DNF

2 pages

ID3

ID3

4 pages

mlHw6

mlHw6

3 pages

MLHW3

MLHW3

4 pages

MLHW4

MLHW4

3 pages

ML-HW2

ML-HW2

3 pages

vcdimCMU

vcdimCMU

20 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

15.svm-1

15.svm-1

18 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

Load more
Download 14.vc
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 14.vc and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 14.vc and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?