1Nov 20th, 2001Copyright © 2001, Andrew W. MooreVC-dimension for characterizing classifiers Andrew W. MooreAssociate ProfessorSchool of Computer ScienceCarnegie Mellon Universitywww.cs.cmu.edu/[email protected] to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received. Copyright © 2001, Andrew W. Moore VC-dimension: Slide 2A learning machine• A learning machine f takes an input x and transforms it, somehow using weights α, into a predicted output yest= +/- 1f xαyestα is some vector of adjustable parameters2Copyright © 2001, Andrew W. Moore VC-dimension: Slide 3Examplesf xαyestf(x,b) = sign(x.x – b)denotes +1denotes -1Copyright © 2001, Andrew W. Moore VC-dimension: Slide 4Examplesf xαyestdenotes +1denotes -1f(x,w) = sign(x.w)3Copyright © 2001, Andrew W. Moore VC-dimension: Slide 5Examplesf xαyestf(x,w,b) = sign(x.w+b)denotes +1denotes -1Copyright © 2001, Andrew W. Moore VC-dimension: Slide 6How do we characterize “power”?• Different machines have different amounts of “power”.• Tradeoff between:• More power: Can model more complex classifiers but might overfit.• Less power: Not going to overfit, but restricted in what it can model.• How do we characterize the amount of power?4Copyright © 2001, Andrew W. Moore VC-dimension: Slide 7Some definitions• Given some machine f• And under the assumption that all training points (xk,yk) were drawn i.i.d from some distribution.• And under the assumption that future test points will be drawn from the same distribution• DefineicationMisclassifofy Probabilit),(21)(TESTERR)( =−== ααα xfyEROfficial terminologyTerminology we’ll useCopyright © 2001, Andrew W. Moore VC-dimension: Slide 8Some definitions• Given some machine f• And under the assumption that all training points (xk,yk) were drawn i.i.d from some distribution.• And under the assumption that future test points will be drawn from the same distribution• DefineicationMisclassifofy Probabilit),(21)(TESTERR)( =−== ααα xfyEROfficial terminologyTerminology we’ll useiedmisclassifSet TrainingFraction ),(211)(TRAINERR)(1=−==∑=RkkkempxfyRR αααR = #training set data points5Copyright © 2001, Andrew W. Moore VC-dimension: Slide 9Vapnik-Chervonenkis dimension• Given some machine f, let h be its VC dimension.• h is a measure of f’s power (h does not depend on the choice of training set)• Vapnik showed that with probability 1-η−= ),(21)(TESTERR αα xfyE∑=−=RkkkxfyR1),(211)(TRAINERR ααRhRh )4/log()1)/2(log()(TRAINERR)(TESTERRηαα−++≤This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of fCopyright © 2001, Andrew W. Moore VC-dimension: Slide 10What VC-dimension is used for• Given some machine f, let h be its VC dimension.• h is a measure of f’s power.• Vapnik showed that with probability 1-η−= ),(21)(TESTERR αα xfyE∑=−=RkkkxfyR1),(211)(TRAINERR ααRhRh )4/log()1)/2(log()(TRAINERR)(TESTERRηαα−++≤This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of fBut given machine f, how do we define and compute h?6Copyright © 2001, Andrew W. Moore VC-dimension: Slide 11Shattering• Machine f can shatter a set of points x1, x2.. xrif and only if…For every possible training set of the form (x1,y1) , (x2,y2) ,… (xr ,yr)…There exists some value of α that gets zero training error.There are 2rsuch training sets to consider, each with a different combination of +1’s and –1’s for the y’sCopyright © 2001, Andrew W. Moore VC-dimension: Slide 12Shattering• Machine f can shatter a set of points x1, x2.. Xrif and only if…For every possible training set of the form (x1,y1) , (x2,y2) ,… (xr ,yr)…There exists some value of α that gets zero training error.• Question: Can the following f shatter the following points?f(x,w) = sign(x.w)7Copyright © 2001, Andrew W. Moore VC-dimension: Slide 13Shattering• Machine f can shatter a set of points x1, x2.. Xrif and only if…For every possible training set of the form (x1,y1) , (x2,y2) ,… (xr ,yr)…There exists some value of α that gets zero training error.• Question: Can the following f shatter the following points?f(x,w) = sign(x.w)• Answer: No problem. There are four training sets to considerw=(0,1) w=(0,-1)w=(2,-3)w=(-2,3)Copyright © 2001, Andrew W. Moore VC-dimension: Slide 14Shattering• Machine f can shatter a set of points x1, x2.. Xrif and only if…For every possible training set of the form (x1,y1) , (x2,y2) ,… (xr ,yr)…There exists some value of α that gets zero training error.• Question: Can the following f shatter the following points?f(x,b) = sign(x.x-b)8Copyright © 2001, Andrew W. Moore VC-dimension: Slide 15Shattering• Machine f can shatter a set of points x1, x2.. Xrif and only if…For every possible training set of the form (x1,y1) , (x2,y2) ,… (xr ,yr)…There exists some value of α that gets zero training error.• Question: Can the following f shatter the following points?f(x,b) = sign(x.x-b)• Answer: No way my friend. Copyright © 2001, Andrew W. Moore VC-dimension: Slide 16Definition of VC dimensionGiven machine f, the VC-dimension h isThe maximum number of points that can be arranged so that f shatter them. Example: What’s VC dimension of f(x,b) = sign(x.x-b)9Copyright © 2001, Andrew W. Moore VC-dimension: Slide 17VC dim of trivial circleGiven machine f, the VC-dimension h isThe maximum number of points that can be arranged so that f shatter them. Example: What’s VC dimension of f(x,b) = sign(x.x-b)Answer = 1: we can’t even shatter two points! (but it’s clear we can shatter 1)Copyright © 2001, Andrew W. Moore VC-dimension: Slide 18Reformulated circleGiven machine f, the VC-dimension h isThe maximum number of points that can be arranged so that f shatter them. Example: For 2-d inputs, what’s VC dimension of f(x,q,b) =
View Full Document