DOC PREVIEW
UCI ICS 273A - ClassificationI 273 Aspring 10

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

ICS 273A Intro Machine Learning Lecture 2 decision trees, random forests, bagging, boosting.Decision Trees Problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)Attribute-based representations • Examples described by attribute values (Boolean, discrete, continuous) • E.g., situations where I will/won't wait for a table: • Classification of examples is positive (T) or negative (F) • General form for data: a number N of instances, each with attributes (x1,x2,x3,...xd) and target value y.Decision trees • One possible representation for hypotheses • We imagine someone taking a sequence of decisions. • E.g., here is the “true” tree for deciding whether to wait: Note you can use the same attribute more than once.Expressiveness • Decision trees can express any function of the input attributes. • E.g., for Boolean functions, truth table row → path to leaf: • Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x) but it probably won't generalize to new examples • Prefer to find more compact decision trees: we don’t want to memorize the data, we want to find structure in the data!Hypothesis spaces How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2n rows = 22n • E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees n=2: 2^2 = 4 rows. For each row we can choose T or F: 2^4 functions.Decision tree learning • If there are so many possible trees, can we actually search this space? (solution: greedy search). • Aim: find a small tree consistent with the training examples • Idea: (recursively) choose "most significant" attribute as root of (sub)tree.Choosing an attribute • Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" • Patrons or type? To wait or not to wait is still at 50%.Using information theory • Entropy measures the amount of uncertainty in a probability distribution: Consider tossing a biased coin. If you toss the coin VERY often, the frequency of heads is, say, p, and hence the frequency of tails is 1-p. (fair coin p=0.5). The uncertainty in any actual outcome is given by the entropy: Note, the uncertainty is zero if p=0 or 1 and maximal if we have p=0.5.Using information theory • If there are more than two states s=1,2,..n we have (e.g. a die):Using information theory • Imagine we have p examples which are true (positive) and n examples which are false (negative). • Our best estimate of true or false is given by: • Hence the entropy is given by:Using Information Theory • How much information do we gain if we disclose the value of some attribute? • Answer: uncertainty before minus uncertainty afterExample Before: Entropy = - ½ log(1/2) – ½ log(1/2)=log(2) = 1 bit: There is “1 bit of information to be discovered”. After: for Type: If we go into branch “French” we have 1 bit, similarly for the others. French: 1bit Italian: 1 bit Thai: 1 bit Burger: 1bit After: for Patrons: In branch “None” and “Some” entropy = 0!, In “Full” entropy = -1/3log(1/3)-2/3log(2/3). So Patrons gains more information! On average: 1 bit ! We gained nothing!Information Gain • How do we combine branches: 1/6 of the time we enter “None”, so we weight“None” with 1/6. Similarly: “Some” has weight: 1/3 and “Full” has weight ½. weight for each branch entropy for each branch.Information gain • Information Gain (IG) or reduction in entropy from the attribute test: • Choose the attribute with the largest IGInformation gain For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root Is it sensible to have the same attribute on a single branch of the tree (why)?Example contd. • Decision tree learned from the 12 examples: • Substantially simpler than “true” tree---a more complex hypothesis isn’t justified by small amount of dataGain-Ratio • If 1 attribute splits in many more classes than another, it has an (unfair) advantage if we use information gain. • The gain-ratio is designed to compensate for this problem, • if we have n uniformly populated classes the denominator is log2(n) which penalized relative to 1 for 2 uniformly populated classes.What to Do if... • In some leaf there are no examples: Choose True or False according to the number of positive/negative examples at your parent. • There are no attributes left Two or more examples have the same attributes but different label: we have an error/noise. Stop and use majority vote. Demo: http://www.cs.ubc.ca/labs/lci/CIspace/Version4/dTree/Continuous Variables • If variables are continuous we can bin them, or.. • We can learn a simple classifier on a single dimension • E.g. we can find a decision point which classifies all data to the left of that point in one class and all data to the right in the other (decision stump – next slide) • We can also use a small subset of dimensions and train a linear classifier (e.g. logistic regression classifier).Decision Stump • Data: {(X1,Y1),(X2,Y2),....,(Xn,Yn)} attributes (e.g. temperature outside) label (e.g. True or False, 0 or 1 -1 or +1) Xi threshold -1 +1 So, we choose one attribute “i”, a sign “+/-” and and a threshold This determines a half space to be +1 and the other half -1. -1 +1 -1When to Stop ? • If we keep going until perfect classification we might over-fit. • Heuristics: • Stop when Info-Gain (Gain-Ratio) is smaller than threshold • Stop when there are M examples in each leaf node •


View Full Document

UCI ICS 273A - ClassificationI 273 Aspring 10

Download ClassificationI 273 Aspring 10
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view ClassificationI 273 Aspring 10 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ClassificationI 273 Aspring 10 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?