DOC PREVIEW
Rutgers University CS 536 - Decision Tree Learning

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Chapter 3:Decision Tree Learning (part 2)CS 536: Machine LearningLittman (Wu, TA)AdministrationBooks?Two on reserve in the math library.iCML-03: instructional Conference onMachine Learningmailing list! (mailing Wekainstructions)2What is ID3 Optimizing?How would you find a tree thatminimizes:• misclassified examples?• expected entropy?• expected number of tests?• depth of tree given a fixed accuracy?• etc.?How decide if one tree beats another?Hypothesis Space Search by ID3ID3:• representation: trees• scoring: entropy• search: greedy3Hypothesis Space Search by ID3• Hypothesis space is complete!– Target function surely in there...• Outputs a single hypothesis (which one?)– Can't play 20 questions...• No back tracking– Local minima...• Statisically-based search choices– Robust to noisy data...• Inductive bias ≈ “prefer shortest tree”Inductive Bias in ID3Note H is the power set of instances X• Unbiased?Not really...• Preference for short trees, and for thosewith high information gain attributesnear the root• Bias is a preference for some hypotheses,rather than a restriction of hypothesisspace H• Occam’s razor: prefer the shortesthypothesis that fits the data4Occam's RazorWhy prefer short hypotheses?Argument in favor:• Fewer short hyps. than long hyps.– a short hyp that fits data unlikely to be coincidence– a long hyp that fits data might be coincidenceArgument opposed:• There are many ways to define small sets ofhyps• e.g., all trees with a prime number of nodes thatuse attributes beginning with “Z”• What's so special about small sets based on sizeof hypothesis??OverfittingConsider adding noisy training example #15:Sunny, Hot, Normal, Strong, PlayTennis = NoWhat effect on earlier tree?5OverfittingConsider error of hypothesis h over• training data: errortrain(h)• entire distribution D of data:errorD(h)Hypothesis h in H overfits trainingdata if there is an alternativehypothesis h’ in H such that•errortrain(h) < errortrain(h’), and•errorD(h) > errorD(h’)Overfitting in Learning6Overfitting in LearningAvoiding OverfittingHow can we avoid overfitting?• stop growing when data split notstatistically significant• grow full tree, then post-prune (DP alg!)How to select “best” tree:• Measure performance over training data• Measure performance over separatevalidation data set• MDL: minimizesize(tree) + size(misclassifications(tree))7Reduced-Error PruningSplit data into training and validation setDo until further pruning is harmful:1. Evaluate impact on validation set ofpruning each possible node (plus thosebelow it)2. Greedily remove the one that mostimproves validation set accuracy• produces smallest version of mostaccurate subtree• What if data is limited?Effect of Pruning8Rule Post-Pruning1. Convert tree to equivalent set ofrules2. Prune each rule independently ofothers3. Sort final rules into desiredsequence for usePerhaps most frequently used method(e.g., C4.5)Converting Tree to Rules9The RulesIF (Outlook = Sunny) ^ (Humidity = High)THEN PlayTennis = NoIF (Outlook = Sunny) ^ (Humidity = Normal)THEN PlayTennis = Yes…Continuous Valued AttributesCreate a discrete attribute to testcontinuous•Temp = 82.5• (Temp > 72.3) = t, fTemp: 40 48 60 72 80 90PlayTennis: No No Yes Yes Yes No10Attributes with Many ValuesProblem:• If one attribute has many valuescompared to the others, Gain will select it• Imagine using Date = Jun_3_1996 asattributeOne approach: use GainRatio insteadGainRatio(S,A) ≡ Gain(S,A) / SplitInfo(S,A)SplitInfo(S,A) ≡ -Si=1c |Si|/|S| log2 |Si|/|S|where Si is subset of S for which A hasvalue viAttributes with CostsConsider• medical diagnosis, BloodTest has cost$150• robotics, Width_from_1ft has cost 23 sec.How to learn a consistent tree with lowexpected cost? Find min cost tree.Another approach: replace gain by• Tan and Schlimmer (1990)Gain2(S,A)/Cost(A)• Nunez (1988) [w in [0,1]: importance)(2Gain(S,A)-1)/(Cost(A)+1)w11Unknown Attribute ValuesSome examples missing values of A?Use training example anyway, sort it• If node n tests A, assign most commonvalue of A among other examples sortedto node n• assign most common value of A amongother examples with same target value• assign probability pi to each possiblevalue vi of A (perhaps as above)– assign fraction pi of example to eachdescendant in tree• Classify new examples in same


View Full Document
Download Decision Tree Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Decision Tree Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Decision Tree Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?