Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Choosing the best Attribute?Slide 10Slide 11Slide 12A better heuristic from Information TheorySlide 14High, Low EntropyInformation Gain: Using Entropy to make decisionsWhen do I play tennis?Slide 18When do I play tennis?Decision TreeIs the decision tree correct?When do I play tennis?Wind attribute – 5 records matchCalculationDecision Trees: Hypothesis Spaces and Search methods recapSlide 26Slide 27A deeper look at Gain RatioSlide 29Slide 30Sources of OverfittingAvoiding OverfittingSlide 33Slide 34Slide 35Slide 36Slide 37Handling Missing ValuesHandling Missing ValuesHandling Continuous attributesSummary: Decision TreesDecision TreesThe University of Texas at DallasChoosing the best Attribute?•Fundamental principle underlying tree creation–Simplicity–Occam’s Razor: Simplest model that explains the data should be preferred•Each node divides the data into subsets–Make each subset as pure as possible.A better heuristic from Information Theory•High, Low Entropy•“High Entropy” –X is from a uniform like distribution–Flat histogram–Values sampled from it are less predictable•“Low Entropy” –X is from a varied (peaks and valleys) distribution–Histogram has many lows and highs–Values sampled from it are more predictableInformation Gain: Using Entropy to make decisions• Read Wikipedia articles on Entropy and Mutual InformationWhen do I play tennis?When do I play tennis?Decision TreeIs the decision tree correct?•Let’s check whether the split on Wind attribute is correct.•We need to show that Wind attribute has the highest information gain.When do I play tennis?Wind attribute – 5 records matchNote: calculate the entropy only on examples that got “routed” in our branch of the tree (Outlook=Rain)Calculation•S = {D4, D5, D6, D10, D14}•Entropy:H(S) = – 3/5log(3/5) – 2/5log(2/5) = 0.971•Information Gain IG(S,Temp) = H(S) – H(S|Temp) = 0.01997IG(S, Humidity) = H(S) – H(S|Humidity) = 0.01997IG(S,Wind) = H(S) – H(S|Wind) = 0.971Decision Trees: Hypothesis Spaces and Search methods recap•We search a variable-sized hypothesis space–We start at empty and grow it as we build it•Space is Complete: All target concepts are included in this space•Local search: No Backtracking•Batch: At each step, we use all training examples to make a statistically based decision.A deeper look at Gain Ratio•What is Split In Information of Date?–Data set size n, each has a different date.•What is Split In Information of an attribute “Weather=Snowing” in Texas?–Snows one day and is sunny or overcast on others•Heuristic–First compute Gain–Apply Gain ratio only on attributes which have above average Gain.Sources of Overfitting•Noise•Small number of examples associated with each leaf–What if only one example is associated with a leaf. Can you believe it?–Coincidental regularities•Generalization is the most important criteria–Your method should work well on examples which you have not seen before.Avoiding Overfitting•Two approaches–Stop growing the tree when data split is not statistically significant–Grow tree fully, then post-prune•Key Issue: What is the correct tree size?–Divide data into training and validation set•Random noise in two sets might be different–Apply statistical test to estimate whether expanding a particular node is likely to produce an improvement beyond the training set–Add a complexity penaltyLeaf nodes added because of coincidental regularities are likely to be pruned because the same coincidences are unlikely to occur in the validation setHandling Missing Values•Missing values: Some attribute-values in an example are missing–Example: patient data. You don’t expect blood test results for everyone.•Treat the missing value as another value•Ignore instances having missing values–Problematic because you are throwing away important data. Data is scarce.Handling Missing Values•Probabilistic approach–Assign a probability to each possible value of A–Let us assume that A(x=1)=0.4 and A(x=0)=0.6–A fractional 0.4 of instance goes to branch A(x=1) and 0.6 to branch A(x=0)–Use fractional instances to compute gain•Classification–Most probable classificationHandling Continuous attributes•Thresholding •How to select Thresholds?•Pick a threshold that has the highest gain!•Sort the examples and calculate gain at all points where the classification changes from “yes to no” or “no to yes”–Provably maximizes the information gain.40 48 60 72 80 90no no yes yes yes noSummary: Decision Trees•Representation•Tree growth•Choosing the best attribute•Overfitting and pruning•Special cases: Missing Attributes and Continuous Attributes•Many forms in practice: CART, ID3, C
View Full Document