UI CS 4420 - Decision Tree Learning - D2399492

Home> Schools> University of Iowa> Computer Science (CS) > CS 4420> Decision Tree Learning

UI CS 4420 - Decision Tree Learning

Pages 30

Download Save

Unformatted text preview:

Decision Tree LearningDecision TreesDecision TreeTop-Down Induction of Decision TreesDecision tree representationTree Uses Nodes, and LeavesDivide and ConquerMultivariate Trees* Entropy* Entropy of a binary variableInformation gainTraining ExamplesSelecting the Next AttributePartially learned treePerformance measurementWhy Learning WorksOccam’s razorOver fitting in Decision TreesSlide 19Avoiding over-fitting the dataReduced error pruningRule post-pruningSlide 23Rule Extraction from TreesSplit Information?Attributes with Many ValuesHandling training examples with missing attribute valuesHandling attributes with differing costsSummaryBasic ProceduresDecision Tree LearningChapter 18.3Decision Trees•One of the most widely used and practical methods for inductive inference•Approximates discrete-valued functions (including disjunctions)•Can be used for classificationDecision Tree•A decision tree can represent a disjunction of conjunctionsof constraints on the attribute values of instances.–Each path corresponds to a conjunction–The tree itself corresponds to a disjunctionIf (O=Sunny AND H=Normal) OR (O=Overcast) OR (O=Rain AND W=Weak) then YESTop-Down Induction of Decision TreesDecision tree representation•Each internal node corresponds to a test•Each branch corresponds to a result of the test•Each leaf node assigns a classificationOnce the tree is trained, a new instance is classified by starting at the root and following the path as dictated by the test results for this instance.Tree Uses Nodes, and LeavesDivide and Conquer•Internal decision nodes–Univariate: Uses a single attribute, xi•Numeric xi : Binary split : xi > wm•Discrete xi : n-way split for n possible values–Multivariate: Uses all attributes, x•Leaves–Classification: Class labels, or proportions–Regression: Numeric; r average, or local fit•The learning algorithm is greedy; find the best split recursivelyMultivariate Trees* Entropy•“Measure of uncertainty”•“Expected number of bits to resolve uncertainty”•Suppose Pr{X = 0} = 1/8–If other events are equally likely, the number of events is 8. To indicate one out of so many events, one needs lg 8 bits.•Consider a binary random variable X s.t. Pr{X = 0} = 0.1.–The expected number of bits:•In general, if a random variable X has c values with prob. p_c:–The expected number of bits:  1.011lg1.011.01lg1.01 11lg lgc ci i ii iiH p p pp= == =-� �* Entropy of a binary variable     pppppH  1lg1lgInformation gainTraining ExamplesSelecting the Next AttributePartially learned treePerformance measurement•How do we know that h ≈ f ?1. Use theorems of computational/statistical learning theory2. Try h on a new test set of examples(use same distribution over example space as training set)Learning curve = % correct on test set as a function of training set sizeWhy Learning Works•There is a theoretic foundation: Computational Learning Theory. •The underlying principle: Any hypothessi that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong: it must be Probably Approximately Correct (PAC).•The Stationarity Assumption: The training and test sets are drawn randomly from the same population of examples using the same probability distribution.Occam’s razor•Prefer the simplest hypothesis that fits the data•Support 1–Possibly because shorter hypotheses have better generalization ability•Support 2–The number of short hypotheses are small, and therefore it is less likely a coincidence if data fit a short hypothesisOver fitting in Decision Trees•Why “over”-fitting?–A model can become more complex than the true target function(concept) when it tries to satisfy noisy data as well.•Definition of overfitting–A hypothesis is said to overfit the training data if there exists some other hypothesis that has larger error over the training data but smaller error over the entire instances.Over fitting in Decision TreesAvoiding over-fitting the data•How can we avoid overfitting? There are 2 approaches:–stop growing the tree before it perfectly classifies the training data–grow full tree, then post-prune•Reduced error pruning•Rule post-pruning–The 2nd approach is found more useful in practice. •Ok, but how to determine the optimal size of a tree?–Use validation examples to evaluate the effect of pruning (stopping)–Use a statistical test to estimate the effect of pruning (stopping)–…Reduced error pruning•Examine each decision node to see if pruning decreases the tree’s performance over the evaluation data.•“Pruning” here means replacing a subtree with a leaf with the most common classification in the subtree.Rule post-pruning•Algorithm–Build a complete decision tree.–Convert the tree to set of rules.–Prune each rule: •Remove any preconditions if any improvement in accuracy–Sort the pruned rules by accuracy and use them in that order.•This is the most frequently used method•IF (Outlook = Sunny) ^ (Humidity = High)THEN PlayTennis = No•IF (Outlook = Sunny) ^ (Humidity = Normal)THEN PlayTennis = Y es•. . .Rule Extraction from TreesSplit Information?•Which is better?–In terms of information gain–In terms of gain ratioA1100 examples40 positive40 negative20 negativeA2100 examples10 positive 10 negativeAttributes with Many Values•One way to penalize such attributes is to use the following alternative measure:( )( )1,( , )( , ), lgci iiGain S AGainRatio S ASplitInformation S AS SSplitInformation S AS S=== -�Entropy of the attribute A:Experimentally determined by the training samplesHandling training examples with missing attribute values•What if an example x is missing the value an attribute A?•Simple solution:–Use the most common value among examples at node n.–Or use the most common value among examples at node n that have classification c(x).•More complex, probabilistic approach–Assign a probability to each of the possible values of A based on the observed frequencies of the various values of A–Then, propagate examples down the tree with these probabilities.–The same probabilities can be used in classification of new instancesHandling attributes with differing costs•Sometimes, some attribute values are more expensive or difficult to prepare.–medical diagnosis, BloodTest has cost $150•In practice, it may be desired to postpone acquisition

View Full Document


School:
Email:
New Password:
Confirm Password:

UI CS 4420 - Decision Tree Learning

Sign up for free to view:

Please select your school