ICS 273A Intro Machine Learning Lecture 2 decision trees, random forests, bagging, boosting.Decision Trees Problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)Attribute-based representations • Examples described by attribute values (Boolean, discrete, continuous) • E.g., situations where I will/won't wait for a table: • Classification of examples is positive (T) or negative (F) • General form for data: a number N of instances, each with attributes (x1,x2,x3,...xd) and target value y.Decision trees • One possible representation for hypotheses • We imagine someone taking a sequence of decisions. • E.g., here is the “true” tree for deciding whether to wait: Note you can use the same attribute more than once.Expressiveness • Decision trees can express any function of the input attributes. • E.g., for Boolean functions, truth table row → path to leaf: • Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x) but it probably won't generalize to new examples • Prefer to find more compact decision trees: we don’t want to memorize the data, we want to find structure in the data!Hypothesis spaces How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2n rows = 22n • E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees n=2: 2^2 = 4 rows. For each row we can choose T or F: 2^4 functions.Decision tree learning • If there are so many possible trees, can we actually search this space? (solution: greedy search). • Aim: find a small tree consistent with the training examples • Idea: (recursively) choose "most significant" attribute as root of (sub)tree.Choosing an attribute • Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" • Patrons or type? To wait or not to wait is still at 50%.Using information theory • Entropy measures the amount of uncertainty in a probability distribution: Consider tossing a biased coin. If you toss the coin VERY often, the frequency of heads is, say, p, and hence the frequency of tails is 1-p. (fair coin p=0.5). The uncertainty in any actual outcome is given by the entropy: Note, the uncertainty is zero if p=0 or 1 and maximal if we have p=0.5.Using information theory • If there are more than two states s=1,2,..n we have (e.g. a die):Using information theory • Imagine we have p examples which are true (positive) and n examples which are false (negative). • Our best estimate of true or false is given by: • Hence the entropy is given by:Using Information Theory • How much information do we gain if we disclose the value of some attribute? • Answer: uncertainty before minus uncertainty afterExample Before: Entropy = - ½ log(1/2) – ½ log(1/2)=log(2) = 1 bit: There is “1 bit of information to be discovered”. After: for Type: If we go into branch “French” we have 1 bit, similarly for the others. French: 1bit Italian: 1 bit Thai: 1 bit Burger: 1bit After: for Patrons: In branch “None” and “Some” entropy = 0!, In “Full” entropy = -1/3log(1/3)-2/3log(2/3). So Patrons gains more information! On average: 1 bit ! We gained nothing!Information Gain • How do we combine branches: 1/6 of the time we enter “None”, so we weight“None” with 1/6. Similarly: “Some” has weight: 1/3 and “Full” has weight ½. weight for each branch entropy for each branch.Information gain • Information Gain (IG) or reduction in entropy from the attribute test: • Choose the attribute with the largest IGInformation gain For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root Is it sensible to have the same attribute on a single branch of the tree (why)?Example contd. • Decision tree learned from the 12 examples: • Substantially simpler than “true” tree---a more complex hypothesis isn’t justified by small amount of dataGain-Ratio • If 1 attribute splits in many more classes than another, it has an (unfair) advantage if we use information gain. • The gain-ratio is designed to compensate for this problem, • if we have n uniformly populated classes the denominator is log2(n) which penalized relative to 1 for 2 uniformly populated classes.What to Do if... • In some leaf there are no examples: Choose True or False according to the number of positive/negative examples at your parent. • There are no attributes left Two or more examples have the same attributes but different label: we have an error/noise. Stop and use majority vote. Demo: http://www.cs.ubc.ca/labs/lci/CIspace/Version4/dTree/Continuous Variables • If variables are continuous we can bin them, or.. • We can learn a simple classifier on a single dimension • E.g. we can find a decision point which classifies all data to the left of that point in one class and all data to the right in the other (decision stump – next slide) • We can also use a small subset of dimensions and train a linear classifier (e.g. logistic regression classifier).Decision Stump • Data: {(X1,Y1),(X2,Y2),....,(Xn,Yn)} attributes (e.g. temperature outside) label (e.g. True or False, 0 or 1 -1 or +1) Xi threshold -1 +1 So, we choose one attribute “i”, a sign “+/-” and and a threshold This determines a half space to be +1 and the other half -1. -1 +1 -1When to Stop ? • If we keep going until perfect classification we might over-fit. • Heuristics: • Stop when Info-Gain (Gain-Ratio) is smaller than threshold • Stop when there are M examples in each leaf node •
View Full Document