Decision TreesBoostingAnnouncementsLinear separabilityNot linearly separable dataAddressing non-linearly separable data – Option 1, non-linear featuresAddressing non-linearly separable data – Option 2, non-linear classifierA small dataset: Miles Per GallonA Decision StumpRecursion StepRecursion StepSecond level of treeClassification of a new exampleAre all decision trees equal?Learning decision trees is hard!!!Choosing a good attributeMeasuring uncertaintyEntropyAndrew Moore’s Entropy in a nutshellAndrew Moore’s Entropy in a nutshellInformation gainLearning decision treesInformation Gain ExampleLook at all the information gains…A Decision StumpBase CasesBase Cases: An ideaThe problem with Base Case 3If we omit Base Case 3:Basic Decision Tree Building SummarizedReal-Valued inputs“One branch for each numeric value” idea:Threshold splitsChoosing threshold splitA better idea: thresholded splitsExample with MPGExample tree using realsDecision trees & Learning BiasDecision trees will overfitA chi-square testA chi-square testUsing Chi-squared to avoid overfittingPruning exampleMaxPchanceWhat you need to know about decision treesFighting the bias-variance tradeoffBoostingLearning from weighted dataWhat t to choose for hypothesis ht?What t to choose for hypothesis ht?What t to choose for hypothesis ht?What t to choose for hypothesis ht?Strong, weak classifiersBoosting: Experimental ResultsBoosting and Logistic RegressionBoosting and Logistic RegressionLogistic regression and BoostingWhat you need to know about BoostingAcknowledgements©2006 Carlos Guestrin1Decision Trees: many possible refs., e.g.,Mitchell, Chapter 3 Boosting: (Linked from class website)Schapire ’01 Decision TreesBoostingMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityFebruary 6th, 2006©2006 Carlos Guestrin2Announcements Recitations stay on Thursdays 5-6:30pm in Wean 5409 This week: Decision Trees and Boosting Pittsburgh won the Super Bowl !!©2006 Carlos Guestrin3Linear separability A dataset is linearly separable iff ∃ a separating hyperplane: ∃ w, such that: w0+ ∑iwixi> 0; if x={x1,…,xn} is a positive example w0+ ∑iwixi< 0; if x={x1,…,xn} is a negative example©2006 Carlos Guestrin4Not linearly separable data Some datasets are not linearly separable!©2006 Carlos Guestrin5Addressing non-linearly separable data – Option 1, non-linear features Choose non-linear features, e.g., Typical linear features: w0+ ∑iwixi Example of non-linear features: Degree 2 polynomials, w0+ ∑iwixi+ ∑ijwijxixj Classifier hw(x) still linear in parameters w Usually easy to learn (closed-form or convex/concave optimization) Data is linearly separable in higher dimensional spaces More discussion later this semester©2006 Carlos Guestrin6Addressing non-linearly separable data – Option 2, non-linear classifier Choose a classifier hw(x) that is non-linear in parameters w, e.g., Decision trees, neural networks, nearest neighbor,… More general than linear classifiers But, can often be harder to learn (non-convex/concave optimization required) But, but, often very useful (BTW. Later this semester, we’ll see that these options are not that different)©2006 Carlos Guestrin7A small dataset: Miles Per Gallon40 Recordsmpg cylinders displacement horsepower weight acceleration modelyear makergood 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america:: : : : : : ::: : : : : : ::: : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europeSuppose we want to predict MPGFrom the UCI repository (thanks to Ross Quinlan)©2006 Carlos Guestrin8A Decision Stump©2006 Carlos Guestrin9Recursion StepTake theOriginalDataset..And partition it accordingto the value of the attribute we split onRecords in which cylinders = 4 Records in which cylinders = 5Records in which cylinders = 6 Records in which cylinders = 8©2006 Carlos Guestrin10Recursion StepRecords in which cylinders = 4 Records in which cylinders = 5Records in which cylinders = 6 Records in which cylinders = 8Build tree fromThese records..Build tree fromThese records..Build tree fromThese records..Build tree fromThese records..©2006 Carlos Guestrin11Second level of tree(Similar recursion in the other cases)Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia©2006 Carlos Guestrin12The final tree©2006 Carlos Guestrin13Classification of a new example Classifying a test example – traverse tree and report leaf label©2006 Carlos Guestrin14Are all decision trees equal? Many trees can represent the same concept But, not all trees will have the same size! e.g., φ = A∧B ∨¬A∧C ((A and B) or (not A and C))©2006 Carlos Guestrin15Learning decision trees is hard!!! Learning the simplest (smallest) decision tree is an NP-complete problem [Hyafil & Rivest ’76] Resort to a greedy heuristic: Start from empty decision tree Split on next best attribute (feature) Recurse©2006 Carlos Guestrin16X1X2YTTTTFTTTTTFTFTTFFFFTFFFFChoosing a good attribute©2006 Carlos Guestrin17Measuring uncertainty Good split if we are more certain about classification after split Deterministic good (all true or all false) Uniform distribution badP(Y=A) = 1/4 P(Y=B) = 1/4 P(Y=C) = 1/4 P(Y=D) = 1/4P(Y=A) = 1/2 P(Y=B) = 1/4 P(Y=C) = 1/8 P(Y=D) = 1/8Entropy©2006 Carlos Guestrin18Entropy H(X) of a random variable YMore uncertainty, more entropy!Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code)©2006 Carlos Guestrin19Andrew Moore’s Entropy in a nutshellLow Entropy High Entropy©2006 Carlos Guestrin20Andrew Moore’s Entropy in a nutshellHigh Entropy..the values (locations of soup)
View Full Document