DOC PREVIEW
CMU CS 10701 - Lecture9

This preview shows page 1-2-3-18-19-36-37-38 out of 38 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Decision TreesAarti SinghMachine Learning 10-701/15-781Oct 6 , 2010Learning a good prediction rule• Learn a mapping • Best prediction rule • Hypothesis space/Function class– Parametric classes (Gaussian, binomial etc.)– Conditionally independent class densities (Naïve Bayes)– Linear decision boundary (Logistic regression)– Nonparametric class (Histograms, nearest neighbor, kernel estimators, Decision Trees – Today)• Given training data, find a hypothesis/function in that is close to the best prediction rule.2First …• What does a decision tree represent• Given a decision tree, how do we assign label to a test point3Decision Tree for Tax Fraud Detection4RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80K• Each internal node: test one feature Xi• Each branch from a node: selects one value for Xi• Each leaf node: predict YRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataDecision Tree for Tax Fraud Detection5RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataDecision Tree for Tax Fraud Detection6RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataDecision Tree for Tax Fraud Detection7RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataNoRefund Marital Status Taxable Income Cheat No Married 80K ? 10Decision Tree for Tax Fraud Detection8RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataNoRefund Marital Status Taxable Income Cheat No Married 80K ? 10Decision Tree for Tax Fraud Detection9RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataNoRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Married Refund Marital Status Taxable Income Cheat No Married 80K ? 10Decision Tree for Tax Fraud Detection10RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataNoRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Married Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Assign Cheat to “No”Decision Tree more generally…11111010• Features can be discrete,continuous or categorical• Each internal node: test some set of features {Xi}• Each branch from a node: selects a set of value for {Xi}• Each leaf node: predict Y11101 110So far…• What does a decision tree represent• Given a decision tree, how do we assign label to a test pointNow …• How do we learn a decision tree from training data• What is the decision on each leaf12So far…• What does a decision tree represent• Given a decision tree, how do we assign label to a test pointNow …• How do we learn a decision tree from training data• What is the decision on each leaf13How to learn a decision tree• Top-down induction *ID3, C4.5, CART, …+14RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KWhich feature is best to split?15X1X2YT T TT F TT T TT F TF T TF F FF T FF F FTFY: 4 Ts0 FsY: 1 Ts3 FsTFY: 3 Ts1 FsY: 2 Ts2 FsGood split if we are more certain about classification after split –Uniform distribution of labels is badAbsolutelysureKind ofsureKind ofsureAbsolutelyunsureWhich feature is best to split?16Pick the attribute/feature which yields maximum information gain:H(Y) – entropy of Y H(Y|Xi) – conditional entropy of YEntropy• Entropy of a random variable YMore uncertainty, more entropy!Y ~ Bernoulli(p)Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code) 17pEntropy, H(Y)Uniform Max entropyDeterministicZero entropyAndrew Moore’s Entropy in a Nutshell18Low EntropyHigh Entropy..the values (locations of soup) unpredictable... almost uniformly sampled throughout our dining room..the values (locations of soup) sampled entirely from within the soup bowlInformation Gain• Advantage of attribute = decrease in uncertainty– Entropy of Y before split– Entropy of Y after splitting based on Xi• Weight by probability of following each branch• Information gain is differenceMax Information gain = min conditional entropy19Information Gain20X1X2YT T TT F TT T TT F TF T TF F FF T FF F FTFTFY: 4 Ts0 FsY: 1 Ts3 FsY: 3 Ts1 FsY: 2 Ts2 Fs> 0Which feature is best to split?21Pick the attribute/feature which yields maximum information gain:H(Y) – entropy of Y H(Y|Xi) – conditional entropy of YFeature which yields maximum reduction in entropy provides maximum information about YExpressiveness of Decision Trees22• Decision trees can express any function of the input features.• E.g., for Boolean functions, truth table row → path to leaf:• There is a decision tree which perfectly classifies a training set with one path to leaf for each example • But it won't generalize well to new examples - prefer to find more compact decision treesDecision Trees - OverfittingOne training example per leaf – overfits, need compact/pruned decision tree23Bias-Variance Tradeoff24fine partitioncoarse partitionvariance smallvariance largebias smallbias largeIdeal classifieraverageclassifierClassifiers based on different training dataWhen to Stop?• Many strategies for picking simpler trees:– Pre-pruning• Fixed depth• Fixed number of leaves– Post-pruning• Chi-square test– Convert decision tree to a set of rules– Eliminate variable values in rules which are independent of label (using chi-square test for independence)– Simplify rule set by eliminating unnecessary rules– Information Criteria: MDL(Minimum Description Length)25RefundMarStNOYes NoMarriedSingle, Divorced• Penalize complex models by introducing cost26log likelihood costregressionclassificationpenalize trees with more leavesInformation Criteria5 leaves => 9 bits to encode structureInformation Criteria - MDLPenalize complex models based on their information content.MDL (Minimum Description Length)Example: Binary Decision treesk leaves => 2k – 1 nodes 2k – 1 bits to encode tree structure + k bits to encode label of each leaf (0/1)# bits needed to describe f(description length)So far…• What does a


View Full Document

CMU CS 10701 - Lecture9

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Lecture9
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture9 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture9 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?