Rutgers University CS 536 - Decision Trees - D1639925

Home> Schools> Rutgers University- The State University of New Jersey> (CS) > CS 536> Decision Trees

DOC PREVIEW

Rutgers University CS 536 - Decision Trees

School name Rutgers University- The State University of New Jersey

Course Cs 536- Machine Learning

Pages 20

This preview shows page 1-2-19-20 out of 20 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 536 – Fall 2005- -CS 536: Machine LearningDecision TreesFall 2005Ahmed ElgammalDept of Computer ScienceRutgers UniversityCS 536 – Fall 2005- -• Decision Tree: a hierarchical model for supervised learning whereby the local region is identified in a sequence of recursive splits.CS 536 – Fall 2005- -• Internal Decision nodes: Each node m implement a test function fm(x)with discrete outcomes labeling the branches. Given an input, the test is applied and one of the branches is taken depending on the outcome.• Terminal leaves: output: class code (for classification) or numeric value (for regression).• Each fm(x) defines a discriminant in the d-dimensional input space dividing it into smaller regions which are further subdivided as we take a path from the root down.• Each terminal leaf defines a localized region in the input space where instances falling in this region have the same label.CS 536 – Fall 2005- -Training Examples for the target concept PlayTennisDay Outlook Temp Hum. Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Nml Weak YesD6 Rain Cool Nml Strong NoD7 Overcast Cool Nml Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Nml Weak YesD10 Rain Mild Nml Weak YesD11 Sunny Mild Nml Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Nml Weak YesD14 Rain Mild High Strong NoCS 536 – Fall 2005- -• Decision Tree represents a disjunction of conjunctions of the constraints on the attributes.• Each path from the root to a leaf correspond to a conjunction ofattribute tests.• The tree itself is a disjunction of these conjunctions.(Outlook= Sunny ∧ Humidity = Normal)∨ ( Outlook = Overcast)∨ (Outlook=Rain ∧Wind = Weak)CS 536 – Fall 2005- -Different kinds of decision Trees:CS 536 – Fall 2005- -Univariate Tree: • One attribute at a time: • Each decision node defines axis-parallel hyperplane.• Each leaf defines a hyper rectangular decision surface.Multivariate decision tree: • all attributes each time: • each decision node defines a hyperplane• Each leaf defines a polyhedral decision surfaceCS 536 – Fall 2005- -Whence Decision Trees?Consider: Discrete Univariate Classification Trees• Instances describable by attribute-value pairs• Target function is discrete valued• Disjunctive hypothesis may be required• Possibly noisy training data or missing attribute valuesExamples:• Equipment or medical diagnosis• Credit risk analysis• Many successful applications that outperform human experts.CS 536 – Fall 2005- -Evolution of Decision Trees:• CLS (Concept Learning System) Earl Hunt 1960’s• ID3 (Interactive Dichotemizer 3) Quinlin 70’s and 80’s• C4.5 Quinlin 90’s CS 536 – Fall 2005- -Top-Down InductionMain loop:1. A ← the “best” decision attribute for next node2. Assign A as decision attribute for node3. For each value of A, create new descendant of node4. Sort training examples to leaf nodes5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodesCS 536 – Fall 2005- -Which Attribute is Best?What is a good quantitative measure of the worth of an attribute ?CS 536 – Fall 2005- -Measuring Entropy• S is a sample of training examples• p⊕is the proportion of positive examples in S• p⊗is the proportion of negative examples in SEntropy measures the impurity of SEntropy(S)=-p⊕log p⊕- p⊗log p⊗CS 536 – Fall 2005- -Entropy FunctionCS 536 – Fall 2005- -EntropyEntropy(S) = expected number of bits needed to encode class (⊕or ⊗) of a randomly drawn member of S (under the optimal, shortest-length code)Why?Information theory: optimal length code assigns - log2p bits to message having probability p.So, expected number of bits to encode ⊕ or ⊗ of a random member of S:p⊕(- log p⊕) + p⊗(- log p⊗)CS 536 – Fall 2005- -Information GainGain(S, A) = expected reduction in entropy due to sorting S on AGain(S, A) ≡Entropy(S) - Σv in Values(A)|Sv|/|S| Entropy(Sv)Here, Svis the set of training instances remaining from S after restricting to those for which attribute A has value v.CS 536 – Fall 2005- -Which Attribute is Best?CS 536 – Fall 2005- -Training ExamplesDay Outlook Temp Hum. Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Nml Weak YesD6 Rain Cool Nml Strong NoD7 Overcast Cool Nml Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Nml Weak YesD10 Rain Mild Nml Weak YesD11 Sunny Mild Nml Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Nml Weak YesD14 Rain Mild High Strong NoCS 536 – Fall 2005- -Selecting the Next AttributeWhich attribute is the best classifier?Gain(S, Humidity) = .940 - (7/14).985 - (7/14).592 = .151Gain(S, Wind) = .940 - (8/14).811 - (6/14)1.0 = .048CS 536 – Fall 2005- -CS 536 – Fall 2005- -Comparing AttributesSsunny= {D1,D2,D8,D9,D11}• Gain (Ssunny, Humidity) = .970 - (3/5) 0.0 - (2/5) 0.0 = .970• Gain (Ssunny, Temp)= .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570• Gain (Ssunny, Wind)= .970 - (2/5) 1.0 - (3/5) .918 = .019CS 536 – Fall 2005- -What is ID3 Optimizing?The hypothesis space searched by ID3 is the set of possible decision trees. Simple-to-Complex hill-climbing (greedy) search guided by information gain measureHow would you find a tree that minimizes:• misclassified examples?• expected entropy?• expected number of tests?• depth of tree given a fixed accuracy?• etc.?How decide if one tree beats another?CS 536 – Fall 2005- -Hypothesis Space Search by ID3ID3:• representation: trees• scoring : entropy• search : greedyCS 536 – Fall 2005- -Hypothesis Space Search by ID3• Hypothesis space is complete!– Target function surely in there...• Outputs a single hypothesis (which one?)– Can't generate all consistent hypotheses... Single Concept.• No back tracking– Local minima... Not globally optimal.• Statistically-based search choices– Robust to noisy data...• Inductive bias ˜ “prefer shortest tree”CS 536 – Fall 2005- -Inductive Bias in ID3Note H is the power set of instances X• Unbiased?Not really...• Preference for short trees, and for those with high information gain attributes near the root• Bias is a preference for some hypotheses, rather than a restriction of

View Full Document