PAC-learning, VC Dimension and Margin-based BoundsReview: Generalization error in finite hypothesis spaces [Haussler ’88]Using a PAC boundLimitations of Haussler ‘88 boundWhat if our classifier does not have zero error on the training data?Simpler question: What’s the expected error of a hypothesis?Using Chernoff bound to estimate error of a single hypothesisBut we are comparing many hypothesis: Union boundGeneralization bound for |H| hypothesisPAC bound and Bias-Variance tradeoffWhat about the size of the hypothesis space?Boolean formulas with n binary featuresNumber of decision trees of depth kPAC bound for decision trees of depth kNumber of decision trees with k leavesPAC bound for decision trees with k leaves – Bias-Variance revisitedWhat did we learn from decision trees?What about continuous hypothesis spaces?How many points can a linear boundary classify exactly? (1-D)How many points can a linear boundary classify exactly? (2-D)How many points can a linear boundary classify exactly? (d-D)PAC bound using VC dimensionShattering a set of pointsVC dimensionExamples of VC dimensionPAC bound for SVMsVC dimension and SVMs: Problems!!!Margin-based VC dimensionApplying margin VC to SVMs?Structural risk minimization theoremReality check – Bounds are looseWhat you need to knowMore details:General: http://www.learning-with-kernels.org/Example of more complex bounds:http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gzPAC-learning, VC Dimension and Margin-based BoundsMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityFebruary 28th, 2005Review: Generalization error in finite hypothesis spaces [Haussler ’88] Theorem: Hypothesis space H finite, dataset Dwith m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data:Even if h makes zero errors in training data, may make errors in testUsing a PAC bound Typically, 2 use cases: 1: Pick ε and δ, give you m 2: Pick m and δ, give you εLimitations of Haussler ‘88 bound Consistent classifier Size of hypothesis spaceWhat if our classifier does not have zero error on the training data? A learner with zero training errors may make mistakes in test set A learner with errorD(h) in training set, may make even more mistakes in test setSimpler question: What’s the expected error of a hypothesis? The error of a hypothesis is like estimating the parameter of a coin! Chernoff bound: for m i.d.d. coin flips, x1,…,xm, where xi∈ {0,1}. For 0<ε<1:Using Chernoff bound to estimate error of a single hypothesisBut we are comparing many hypothesis: Union boundFor each hypothesis hi:What if I am comparing two hypothesis, h1 and h2?Generalization bound for |H| hypothesis Theorem: Hypothesis space H finite, dataset Dwith m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h:PAC bound and Bias-Variance tradeoff Important: PAC bound holds for all h, but doesn’t guarantee that algorithm finds best h!!!or, after moving some terms around,with probability at least 1-δ:What about the size of the hypothesis space? How large is the hypothesis space?Boolean formulas with n binary featuresNumber of decision trees of depth kRecursive solution Given n attributesHk= Number of decision trees of depth kH0=2Hk+1= (#choices of root attribute) *(# possible left subtrees) *(# possible right subtrees)= n * Hk* HkWrite Lk= log2HkL0= 1Lk+1= log2n + 2LkSo Lk= (2k-1)(1+log2n) +1PAC bound for decision trees of depth k Bad!!! Number of points is exponential in depth! But, for m data points, decision tree can’t get too big…Number of leaves never more than number data pointsNumber of decision trees with k leavesHk= Number of decision trees with k leavesH0=2Loose bound: Reminder:PAC bound for decision trees with k leaves – Bias-Variance revisitedWhat did we learn from decision trees? Bias-Variance tradeoff formalized Moral of the story:Complexity of learning not measured in terms of size hypothesis space, but in maximum number of points that allows consistent classification Complexity m – no bias, lots of variance Lower than m – some bias, less varianceWhat about continuous hypothesis spaces? Continuous hypothesis space: |H| = ∞ Infinite variance??? As with decision trees, only care about the maximum number of points that can be classified exactly!How many points can a linear boundary classify exactly? (1-D)How many points can a linear boundary classify exactly? (2-D)How many points can a linear boundary classify exactly? (d-D)PAC bound using VC dimension Number of training points that can be classified exactly is VC dimension!!! Measures relevant size of hypothesis space, as with decision trees with k leavesShattering a set of pointsVC dimensionExamples of VC dimension Linear classifiers: VC(H) = d+1, for d features plus constant term b Neural networks VC(H) = #parameters Local minima means NNs will probably not find best parameters 1-Nearest neighbor?PAC bound for SVMs SVMs use a linear classifier For d features, VC(H) = d+1:VC dimension and SVMs: Problems!!! What about kernels? Polynomials: num. features grows really fast = Bad bound Gaussian kernels can classify any set of points exactlyDoesn’t take margin into accountn – input featuresp – degree of polynomialMargin-based VC dimension H: Class of linear classifiers: w.Φ(x) (b=0) Canonical form: minj|w.Φ(xj)| = 1 VC(H) = R2w.w Doesn’t depend on number of features!!! R2= maxjΦ(xj).Φ(xj) – magnitude of data R2is bounded even for Gaussian kernels → bounded VC dimension Large margin, low w.w, low VC dimension – Very cool!Applying margin VC to SVMs? VC(H) = R2w.w R2= maxjΦ(xj).Φ(xj) – magnitude of data, doesn’t depend on choice of w SVMs minimize w.w SVMs minimize VC dimension to get best bound? Not quite right: / Bound assumes VC dimension chosen before looking at data Would require union bound over infinite number of possible VC dimensions… But, it can be fixed!Structural risk minimization theorem For a family of hyperplanes with margin γ>0 w.w · 1 SVMs maximize margin γ + hinge loss Optimize tradeoff training error (bias) versus margin γ(variance)Reality check – Bounds are looseεm (in 105)d=2000d=200d=20d=2 Bound can be very loose, why should you care? There are tighter, albeit more complicated, bounds Bounds
View Full Document