DOC PREVIEW
UT Dallas CS 6375 - 14.vc

This preview shows page 1-2-23-24 out of 24 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

11Machine LearningCS6375 --- Spring 2015aLearning TheoryInstructor: Yang LiuAcknowledgement: Vincent Ng, Andrew Moore, Tom Mitchell2Learning Theory• Bias and variance• Probably approximately correct (PAC) learning• Vapnik-Chervonenkis (VC) dimension• Sample complexity • Mistake bound23Model Loss (Error)Let D be a set of training instances.Squared loss of model on test case x:()2)(),( xTruthDxLearn −Expected prediction error:( )[ ]DxTruthDxLearnE2)(),( −4( )[ ]VarianceBiasNoisexTDxLED++=−222)(),(Bias/Variance Decomposition( )sample training toduevariation mismatch model todueerror expectedeperformancon boundlower 222===VarianceBiasNoise35Regression Illustration for Bias/Variance• True function is: y=f(x)+ε– where ε is N(0,σ2)• Given a set of training examples, {(xi,yi)}, we fit a hypothesis h(x)=wx+b to minimize the squared error Σi[yi-h(xi)]2• Now given a new data point x*, with observation value y*=f(x*)+ ε, we want to understand the expected predicted error E[(y*-h(x*))2]• Assume data samples are generated based on distribution P, we are going to decompose the expected error into bias, variance, and noise6Y=x+2*sin(1.5*x)+N(0,0.2)47Bias/Variance DecompositionLemma: Var[Z] = E[(Z-Z)2] = E[Z2] - Z2 E[Z2] = E[(Z-Z)2 ]+ Z2 8Bias/Variance DecompositionExpected predicted error = Variance + bias2+ noise259Bias/Variance Decomposition• Noise: σ2– describes how much y* varies from f(x*)• Bias: [h(x*)-f(x*)]– describes the average error of h(x*)• Variance: E[(h(x*)-h(x*))2]– describes how much h(x*) varies from one training set to anotherNote: Similar theory can be used for classification.10Bias• Low bias– linear regression applied to linear data– 2nd degree polynomial applied to quadratic data• High bias– constant function– linear regression applied to non-linear data– neural net with few hidden units applied to non-linear data611Variance• Low variance – constant function– model independent of training data• High variance– high degree polynomial– neural net with many hidden units trained to completion12Bias/Variance Tradeoff• (bias2+variance) is what counts for prediction• Often:– low bias => high variance– low variance => high bias• Tradeoff: – bias2vs. variance713Bias/Variance TradeoffHastie, Tibshirani, Friedman “Elements of Statistical Learning” 2001Model complexityPrediction error14Complexity of ModelsUsually the bias is a decreasing function of the complexity, while variance is an increasing function of the complexity.E=bias2+varbias2varcomplexity815Bias/Variance in EnsembleAveraging reduces variance:Var(X)=Var(X)NBagging:• Reduces variance by averaging (if bootstrap replicate approximation were correct), it has little effect on bias• In practice, models are correlated, so reduction is smaller than 1/NBoosting:• Early iterations, it is primarily a bias-reducing methodlater iterations, primarily variance-reducing16Bias and Variance SummaryFor regression problems (squared error loss), the expected error rate can be decomposed into: bias2 + Variance + Noise– Bias arises when the classifier can’t represent the true function (underfits the data)– Variance arises when the classifier overfits the data– There is often a tradeoff between bias and variance• k-NN: increasing k typically increases bias and reduces variance• decision tree: increasing depth typically increases variance and reduces bias917Learning TheoryWhat general laws constrain inductive learning? We seek theory to relate:• Probability of successful learning• Number of training examples• Complexity of hypothesis space• Accuracy to which target concept is approximated• Manner in which training examples presented18Probably Approximately Correct (PAC) LearningHow many training examples are sufficient to learn the target concept?Given: – Set of instances X– Set of hypotheses H– Set of possible target concept C– Training instances generated by a fixed, unknown probability distribution D over X • Learner observes a sequence D of training examples of form <x,c(x)>, for some target concept c. • Learner must output a hypothesis h estimating c• Two notions of error: training and testing error1019PAC LearningChoose the number of samples, R, such that with probability less than δ we will select a bad hest(which makes mistakes more than fraction ε of the time) ― PACThis can be achieved by choosing R such that H is the number of hypotheses)1ln(ln1δε+= H20PAC Learning• Imagine we are doing classification with categorical inputs• All inputs and outputs are binary• Data is noiseless• There is a machine, f(x,h) which has H possible hypotheses, called h1, h2, … hH.1121Example 1•f(x,h) consists of all logical sentences about x1, x2, … xmthat contain only logical ands.•Example: x1^ x3^ x9•Question: if there are m attributes, how many hypotheses in f?H=2m)1log(69.02δε+≥ mR22Example 2•f(x,h) consists of all logical sentences about x1, x2, … xmor their negations that contain only logical ands.•Example: x1^ ~x3^ x9•Question: if there are m attributes, what is the size of the complete set of hypotheses in f? H=3m)1log)3((log69.022δε+≥ mR1223Example 3•f(x,h) consists of all truth tables mapping combinations of input attributes to true and false•Question: if there are m attributes, what is the size of the complete set of hypotheses in f? )1log2(69.02δε+≥mR24PAC LearningWhat about very large H or infinite hyp space?Consider another measure of the complexity of hypothesis space ― VC dimension.)1ln(ln1δε+≥ HR1325A Learning Machine• A learning machine f takes an input x and transforms it,somehow using weights α, into a predicted output yest= +/- 126Example 11427Example 228Example 31529How do we characterize “power”?• Different machines have different amounts of “power”.• Tradeoff between:– More power: Can model more complex classifiers but might overfit.– Less power: Not going to overfit, but restricted in what it can model.• How do we characterize the amount of power?30Some Definitions• Given some machine f• And under the assumption that all training points (xk,yk) were drawn from some distribution.• And under the assumption that future test points will be drawn from the same distributionDefine1631Some Definitions• Given some machine f• And under the assumption that all training points (xk,yk) were drawn from some distribution.• And


View Full Document

UT Dallas CS 6375 - 14.vc

Documents in this Course
ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw5

hw5

2 pages

hw3

hw3

3 pages

20.mdp

20.mdp

19 pages

19.em

19.em

17 pages

16.svm-2

16.svm-2

16 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

21.rl

21.rl

18 pages

CNF-DNF

CNF-DNF

2 pages

ID3

ID3

4 pages

mlHw6

mlHw6

3 pages

MLHW3

MLHW3

4 pages

MLHW4

MLHW4

3 pages

ML-HW2

ML-HW2

3 pages

vcdimCMU

vcdimCMU

20 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

15.svm-1

15.svm-1

18 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

Load more
Download 14.vc
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 14.vc and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 14.vc 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?