DOC PREVIEW
UCLA STAT 231 - Lecture18-PAC

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1. Stat 231. A.L. Yuille. Fall 20042. Induction: History.3. Risk and Empirical Risk4. Risk and Empirical Risk5. PAC.6. PAC7. PAC8. VC for Margins9. VC Margin Hyperplanes10. VC Margin: Kernels.11. Generalizability for Kernels12. Generalization for Kernels13. Structural Risk Minimization14. Structural Risk Minimization15. Structural Risk Minimization16 SummaryLecture notes for Stat 231: Pattern Recognition and Machine Learning1. Stat 231. A.L. Yuille. Fall 2004PAC Learning and Generalizability.Margin Errors.Structural Risk Minimization.Lecture notes for Stat 231: Pattern Recognition and Machine Learning2. Induction: History.Francis Bacon described empiricism. Formulate hypotheses and test by experiments. English Empiricist School of Philosophy.David Hume. Scottish. Scepticism. “Why should the Sun rise tomorrow just because it always has”?Karl Popper. The Logic of Scientific Induction. Falsifiability Principle. “A hypothesis is useless unless it can be disproven”.Lecture notes for Stat 231: Pattern Recognition and Machine Learning3. Risk and Empirical RiskRisk Specialize: Two classes: M=2. Loss Function is the number of misclassifications. I.e. Empirical Risk:dataset-- set of learning machines (e.g. all thresholded hyperplanes).Lecture notes for Stat 231: Pattern Recognition and Machine Learning4. Risk and Empirical RiskKey Concept: the Vapnik-Chervonenkis (VC) dimension h.The VC dimension is a function of the set of classifiers It is independent of the distribution P(x,y) of the dataset.The VC dimension is a measure of the “degrees of freedom”of the set of classifiers. Intuitively, the size of the dataset n must be larger than the VC dimension before you can learn.E.G. Cover’s theorem. Hyperplanes in d space must have at least 2(d+1) samples to prevent the chance of finding a chance dichotomy.Lecture notes for Stat 231: Pattern Recognition and Machine Learning5. PAC.Probably Approximately Correct (PAC).If h < n, is the VC dimension of the classifier set, then with probability at leastwhereFor hyperplanes h = d+1.Lecture notes for Stat 231: Pattern Recognition and Machine Learning6. PACGeneralizability: Small empirical risk implies, with high probability, small risk provided is small.Probably Approximately Correct (PAC).Because we can never be completely sure that we havn’t been mislead by rare samples. In practice, require h/n to be small with smallLecture notes for Stat 231: Pattern Recognition and Machine Learning7. PACThis is the basic Machine Learning result. There are a number of variants.VC dimension is one measure of the capacity of the set of classifiers. Other measures give tighter bounds but are harder to compute: annealed VC entropy, and growth function.VC dimension is d+1 for thresholded hyperplanes. It can also be bounded nicely for separable kernels. (Later this lecture).Forthcoming lecture will sketch the derivation of PAC. It makes use of probability of rare events (e.g. Cramer’s theorem, Sanov’s theorem).Lecture notes for Stat 231: Pattern Recognition and Machine Learning8. VC for MarginsVC is the largest number of data points which can be shattered by the classifier set. Shattered means that all possible dichotomies of the dataset can be expressed by a classifier in the set. (c.f. Cover’s hyperplane)VC dimension is (d+1) for thresholded hyperplanes in d dimensions.But we can tighter VC dimensions by considering the margins. These bounds can be extended directly to kernel hyperplanes.Lecture notes for Stat 231: Pattern Recognition and Machine Learning9. VC Margin HyperplanesHyperplanes The are normalized wrt data Then the set of classifierssatisfying, has VC dimension satisfying:where is the radius of the smallest sphere containing the datapoints. Recall is the margin. (Margin >.Enforcing a large margin effectively limits the VC dimension.Lecture notes for Stat 231: Pattern Recognition and Machine Learning10. VC Margin: Kernels.Same technique applies to kernels.Claim: finding the minimum sphere R than encloses the data depends only on the feature vectors by the kernel (kernel trick).Primal: minimize Lagrange multipliers.Dual: maximize s.t.  Depends on dot-product only!Lecture notes for Stat 231: Pattern Recognition and Machine Learning11. Generalizability for KernelsThe capacity term is a monotonic function of h.Use the Margin VC bound to decide which kernels will do best for learning the US Post Office handwritten dataset.For each kernel choice, solve the dual problem to estimate R.Assume that the empirical risk is negligible – because it is possible to classify digits correctly using kernels (but not linear).This predicts that the fourth order kernel has the best generalization – this compares nicely with the results of the classifiers when tested.Lecture notes for Stat 231: Pattern Recognition and Machine Learning12. Generalization for KernelsLecture notes for Stat 231: Pattern Recognition and Machine Learning13. Structural Risk MinimizationStandard Learning says: pick Traditional: use cross-validation to determine ifis generalizing.VC theory says, evaluate the bound Ensure there are sufficient number of samples to ensure that is is small.Alternative: Structural Risk Minimization. Divide the set of classifiers into a hierarchy of sets .,... with corresponding VC-dims ...Lecture notes for Stat 231: Pattern Recognition and Machine Learning14. Structural Risk MinimizationSelect classifiers to minimize:Empirical Risk + Capacity Term.Capacity Term determines the “generalizability” of the classifier.Increasing the amount of training data allows you to increase pand use a richer class of classifiers.Is the bound tight enough?Lecture notes for Stat 231: Pattern Recognition and Machine Learning15. Structural Risk MinimizationLecture notes for Stat 231: Pattern Recognition and Machine Learning16 SummaryPAC Learning and the VC dimension.The VC dimension is a measure of capacity of the set of classifiers.The risk is bounded by the empirical risk plus a capacity term.VC dimensions can be bounded for linear and kernels by the margin concept.This can


View Full Document

UCLA STAT 231 - Lecture18-PAC

Download Lecture18-PAC
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture18-PAC and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture18-PAC 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?