PAC-learning, VC Dimension and Margin-based BoundsAnnouncements 1Announcements 2What now…How likely is learner to pick a bad hypothesisUnion boundHow likely is learner to pick a bad hypothesisReview: Generalization error in finite hypothesis spaces [Haussler ’88]Using a PAC boundReview: Generalization error in finite hypothesis spaces [Haussler ’88]Limitations of Haussler ‘88 boundSimpler question: What’s the expected error of a hypothesis?But we are comparing many hypothesis: Union boundGeneralization bound for |H| hypothesisPAC bound and Bias-Variance tradeoffWhat about the size of the hypothesis space?Boolean formulas with n binary featuresNumber of decision trees of depth kPAC bound for decision trees of depth kNumber of decision trees with k leavesPAC bound for decision trees with k leaves – Bias-Variance revisitedWhat did we learn from decision trees?What about continuous hypothesis spaces?How many points can a linear boundary classify exactly? (1-D)How many points can a linear boundary classify exactly? (2-D)How many points can a linear boundary classify exactly? (d-D)Shattering a set of pointsVC dimensionPAC bound using VC dimensionExamples of VC dimensionAnother VC dim. examplePAC bound for SVMsVC dimension and SVMs: Problems!!!Margin-based VC dimensionApplying margin VC to SVMs?Structural risk minimization theoremReality check – Bounds are looseWhat you need to knowBig PictureWhat you have learned thus farReview material in terms of…Text ClassificationFunction fittingMonitoring a complex systemTypes of learning problemsThe learning problemComparing learning algorithmsNaïve Bayes versus Logistic regressionNaïve Bayes versus Logistic regression – Classification as density estimationLogistic regression versus BoostingLinear classifiers – Logistic regression versus SVMsWhat’s the difference between SVMs and Logistic Regression? (Revisited again)SVMs and instance-based learningInstance-based learning versus Decision treesLogistic regression versus Neural netsLinear regression versus Kernel regressionKernel-weighted linear regressionSVM regressionBIG PICTURE (a few points of comparison)©2006 Carlos Guestrin1More details:General: http://www.learning-with-kernels.org/Example of more complex bounds:http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gzPAC-learning, VC Dimension and Margin-based BoundsMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityMarch 6th, 2006©2006 Carlos Guestrin2Announcements 1 Midterm on Wednesday open book, texts, notes,… no laptops bring a calculator©2006 Carlos Guestrin3Announcements 2 Final project details are out!!! http://www.cs.cmu.edu/~guestrin/Class/10701/projects.html Great opportunity to apply ideas from class and learn more Example project: Take a dataset Define learning task Apply learning algorithms Design your own extension Evaluate your ideas many of suggestions on the webpage, but you can also do your own Boring stuff: Individually or groups of two students It’s worth 20% of your final grade You need to submit a one page proposal on Wed. 3/22 (just after the break) A 5-page initial write-up (milestone) is due on 4/12 (20% of project grade) An 8-page final write-up due 5/8 (60% of the grade) A poster session for all students will be held on Friday 5/5 2-5pm in NSH atrium (20% of the grade) You can use late days on write-ups, each student in team will be charged a late day per day. MOST IMPORTANT:©2006 Carlos Guestrin4What now… We have explored many ways of learning from data But… How good is our classifier, really? How much data do I need to make it “good enough”?©2006 Carlos Guestrin5How likely is learner to pick a bad hypothesis Prob. h with errortrue(h) ≥ ε gets m data points right There are k hypothesis consistent with data How likely is learner to pick a bad one?©2006 Carlos Guestrin6Union bound P(A or B or C or D or …)©2006 Carlos Guestrin7How likely is learner to pick a bad hypothesis Prob. h with errortrue(h) ≥ ε gets m data points right There are k hypothesis consistent with data How likely is learner to pick a bad one?©2006 Carlos Guestrin8Review: Generalization error in finite hypothesis spaces [Haussler ’88] Theorem: Hypothesis space H finite, dataset Dwith m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data:©2006 Carlos Guestrin9Using a PAC bound Typically, 2 use cases: 1: Pick ε and δ, give you m 2: Pick m and δ, give you ε©2006 Carlos Guestrin10Review: Generalization error in finite hypothesis spaces [Haussler ’88] Theorem: Hypothesis space H finite, dataset Dwith m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data:Even if h makes zero errors in training data, may make errors in test©2006 Carlos Guestrin11Limitations of Haussler ‘88 bound Consistent classifier Size of hypothesis space©2006 Carlos Guestrin12Simpler question: What’s the expected error of a hypothesis? The error of a hypothesis is like estimating the parameter of a coin! Chernoff bound: for m i.d.d. coin flips, x1,…,xm, where xi∈ {0,1}. For 0<ε<1:©2006 Carlos Guestrin13But we are comparing many hypothesis: Union boundFor each hypothesis hi:What if I am comparing two hypothesis, h1 and h2?©2006 Carlos Guestrin14Generalization bound for |H| hypothesis Theorem: Hypothesis space H finite, dataset Dwith m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h:©2006 Carlos Guestrin15PAC bound and Bias-Variance tradeoff Important: PAC bound holds for all h, but doesn’t guarantee that algorithm finds best h!!!or, after moving some terms around,with probability at least 1-δ:©2006 Carlos Guestrin16What about the size of the hypothesis space? How large is the hypothesis space?©2006 Carlos Guestrin17Boolean formulas with n binary features©2006 Carlos Guestrin18Number of decision trees of depth kRecursive solution Given n attributesHk= Number of decision trees of depth kH0=2Hk+1= (#choices of root attribute) *(# possible left subtrees) *(# possible right subtrees)= n * Hk* HkWrite Lk= log2HkL0= 1Lk+1= log2n + 2LkSo Lk= (2k-1)(1+log2n) +1©2006 Carlos Guestrin19PAC bound for decision trees of depth k Bad!!! Number of points is exponential in depth! But, for m data points, decision tree can’t get too
View Full Document