DOC PREVIEW
UT Dallas CS 6375 - vc dimension_IEEE

This preview shows page 1-2-3-18-19-37-38-39 out of 39 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 39 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Introduction to Statistical Learning TheoryOlivier Bousquet1, St´ephane Boucheron2, and G´abor Lugosi31Max-Planck Institute for Biological CyberneticsSpemannstr. 38, D-72076 T¨ubingen, [email protected] home page: http://www.kyb.mpg.de/~bousquet2Universit´e de Paris-Sud, Laboratoire d’InformatiqueBˆatiment 490, F-91405 Orsay Cedex, [email protected] home page: http://www.lri.fr/~bouchero3Department of Economics, Pompeu Fabra UniversityRamon Trias Fargas 25-27, 08005 Barcelona, [email protected] home page: http://www.econ.upf.es/~lugosiAbstract. The goal of statistical learning theory is to study, in a sta-tistical framework, the properties of learning algorithms. In particular,most results take the form of so-called error bounds. This tutorial intro-duces the techniques that are used to obtain such results.1 IntroductionThe main goal of statistical learning theory is to provide a framework for study-ing the problem of inference, that is of gaining knowledge, making predictions,making decisions or constructing models from a set of data. This is studied in astatistical framework, that is there are assumptions of statistical nature aboutthe underlying phenomena (in the way the data is generated).As a motivation for the need of such a theory, let us just quote V. Vapnik:(Vapnik, [1]) Nothing is more practical than a good theory.Indeed, a theory of inference should be able to give a formal definition of wordslike learning, generalization, overfitting, and also to characterize the performanceof learning algorithms so that, ultimately, it may help design better learningalgorithms.There are thus two goals: make things more precise and derive new or improvedalgorithms.1.1 Learning and InferenceWhat is under study here is the process of inductive inference which can roughlybe summarized as the following steps:176 Bousquet, Boucheron & Lugosi1. Observe a phenomenon2. Construct a model of that phenomenon3. Make predictions using this modelOf course, this definition is very general and could be taken more or less as thegoal of Natural Sciences. The goal of Machine Learning is to actually automatethis process and the goal of Learning Theory is to formalize it.In this tutorial we consider a special case of the above process which is thesupervised learning framework for pattern recognition. In this framework, thedata consists of instance-label pairs, where the label is either +1 or −1. Given aset of such pairs, a learning algorithm constructs a function mapping instances tolabels. This function should be such that it makes few mistakes when predictingthe label of unseen instances.Of course, given some training data, it is always possible to build a functionthat fits exactly the data. But, in the presence of noise, this may not be thebest thing to do as it would lead to a poor performance on unseen instances(this is usually referred to as overfitting). The general idea behind the design of0 0.5 1 1.500.511.5Fig. 1. Trade-off between fit and complexity.learning algorithms is thus to look for regularities (in a sense to be defined later)in the observed phenomenon (i.e. training data). These can then be generalizedfrom the observed past to the future. Typically, one would look, in a collectionof possible models, for one which fits well the data, but at the same time is assimple as possible (see Figure 1). This immediately raises the question of howto measure and quantify simplicity of a model (i.e. a {−1, +1}-valued function).Statistical Learning Theory 177It turns out that there are many ways to do so, but no best one. For examplein Physics, people tend to prefer models which have a small number of constantsand that correspond to simple mathematical formulas. Often, the length of de-scription of a model in a coding language can be an indication of its complexity.In classical statistics, the number of free parameters of a model is usually ameasure of its complexity. Surprisingly as it may seem, there is no universal wayof measuring simplicity (or its counterpart complexity) and the choice of a spe-cific measure inherently depends on the problem at hand. It is actually in thischoice that the designer of the learning algorithm introduces knowledge aboutthe specific phenomenon under study.This lack of universally best choice can actually be formalized in what iscalled the No Free Lunch theorem, which in essence says that, if there is noassumption on how the past (i.e. training data) is related to the future (i.e. testdata), prediction is impossible. Even more, if there is no a priori restriction onthe possible phenomena that are expected, it is impossible to generalize andthere is thus no better algorithm (any algorithm would be beaten by anotherone on some phenomenon).Hence the need to make assumptions, like the fact that the phenomenon weobserve can be explained by a simple model. However, as we said, simplicity isnot an absolute notion, and this leads to the statement that data cannot replaceknowledge, or in pseudo-mathematical terms:Generalization = Data + Knowledge1.2 AssumptionsWe now make more precise the assumptions that are made by the StatisticalLearning Theory framework. Indeed, as we said before we need to assume thatthe future (i.e. test) observations are related to the past (i.e. training) ones, sothat the phenomenon is somewhat stationary.At the core of the theory is a probabilistic model of the phenomenon (or datageneration process). Within this model, the relationship between past and futureobservations is that they both are sampled independently from the same distri-bution (i.i.d.). The independence assumption means that each new observationyields maximum information. The identical distribution means that the obser-vations give information about the underlying phenomenon (here a probabilitydistribution).An immediate consequence of this very general setting is that one can con-struct algorithms (e.g. k-nearest neighbors with appropriate k) that are consis-tent, which means that, as one gets more and more data, the predictions of thealgorithm are closer and closer to the optimal ones. So this seems to indicate thatwe can have some sort of universal algorithm. Unfortunately, any (consistent)algorithm can have an arbitrarily bad behavior when given a finite training set.These notions are formalized in Appendix B.Again, this discussion indicates that generalization can only come when oneadds specific knowledge to the data. Each learning


View Full Document

UT Dallas CS 6375 - vc dimension_IEEE

Documents in this Course
ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw5

hw5

2 pages

hw3

hw3

3 pages

20.mdp

20.mdp

19 pages

19.em

19.em

17 pages

16.svm-2

16.svm-2

16 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

21.rl

21.rl

18 pages

CNF-DNF

CNF-DNF

2 pages

ID3

ID3

4 pages

mlHw6

mlHw6

3 pages

MLHW3

MLHW3

4 pages

MLHW4

MLHW4

3 pages

ML-HW2

ML-HW2

3 pages

vcdimCMU

vcdimCMU

20 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

Load more
Download vc dimension_IEEE
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view vc dimension_IEEE and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view vc dimension_IEEE 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?