CMU CS 10701 - Lecture9 - D2680460

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10701> Lecture9

DOC PREVIEW

CMU CS 10701 - Lecture9

School name Carnegie Mellon University

Course Cs 10701- Introduction to Machine Learning

Pages 38

This preview shows page 1-2-3-18-19-36-37-38 out of 38 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Decision TreesAarti SinghMachine Learning 10-701/15-781Oct 6 , 2010Learning a good prediction rule• Learn a mapping • Best prediction rule • Hypothesis space/Function class– Parametric classes (Gaussian, binomial etc.)– Conditionally independent class densities (Naïve Bayes)– Linear decision boundary (Logistic regression)– Nonparametric class (Histograms, nearest neighbor, kernel estimators, Decision Trees – Today)• Given training data, find a hypothesis/function in that is close to the best prediction rule.2First …• What does a decision tree represent• Given a decision tree, how do we assign label to a test point3Decision Tree for Tax Fraud Detection4RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80K• Each internal node: test one feature Xi• Each branch from a node: selects one value for Xi• Each leaf node: predict YRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataDecision Tree for Tax Fraud Detection5RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataDecision Tree for Tax Fraud Detection6RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataDecision Tree for Tax Fraud Detection7RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataNoRefund Marital Status Taxable Income Cheat No Married 80K ? 10Decision Tree for Tax Fraud Detection8RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataNoRefund Marital Status Taxable Income Cheat No Married 80K ? 10Decision Tree for Tax Fraud Detection9RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataNoRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Married Refund Marital Status Taxable Income Cheat No Married 80K ? 10Decision Tree for Tax Fraud Detection10RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Query DataNoRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Married Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Assign Cheat to “No”Decision Tree more generally…11111010• Features can be discrete,continuous or categorical• Each internal node: test some set of features {Xi}• Each branch from a node: selects a set of value for {Xi}• Each leaf node: predict Y11101 110So far…• What does a decision tree represent• Given a decision tree, how do we assign label to a test pointNow …• How do we learn a decision tree from training data• What is the decision on each leaf12So far…• What does a decision tree represent• Given a decision tree, how do we assign label to a test pointNow …• How do we learn a decision tree from training data• What is the decision on each leaf13How to learn a decision tree• Top-down induction *ID3, C4.5, CART, …+14RefundMarStTaxIncYESNONONOYes NoMarriedSingle, Divorced< 80K > 80KWhich feature is best to split?15X1X2YT T TT F TT T TT F TF T TF F FF T FF F FTFY: 4 Ts0 FsY: 1 Ts3 FsTFY: 3 Ts1 FsY: 2 Ts2 FsGood split if we are more certain about classification after split –Uniform distribution of labels is badAbsolutelysureKind ofsureKind ofsureAbsolutelyunsureWhich feature is best to split?16Pick the attribute/feature which yields maximum information gain:H(Y) – entropy of Y H(Y|Xi) – conditional entropy of YEntropy• Entropy of a random variable YMore uncertainty, more entropy!Y ~ Bernoulli(p)Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code) 17pEntropy, H(Y)Uniform Max entropyDeterministicZero entropyAndrew Moore’s Entropy in a Nutshell18Low EntropyHigh Entropy..the values (locations of soup) unpredictable... almost uniformly sampled throughout our dining room..the values (locations of soup) sampled entirely from within the soup bowlInformation Gain• Advantage of attribute = decrease in uncertainty– Entropy of Y before split– Entropy of Y after splitting based on Xi• Weight by probability of following each branch• Information gain is differenceMax Information gain = min conditional entropy19Information Gain20X1X2YT T TT F TT T TT F TF T TF F FF T FF F FTFTFY: 4 Ts0 FsY: 1 Ts3 FsY: 3 Ts1 FsY: 2 Ts2 Fs> 0Which feature is best to split?21Pick the attribute/feature which yields maximum information gain:H(Y) – entropy of Y H(Y|Xi) – conditional entropy of YFeature which yields maximum reduction in entropy provides maximum information about YExpressiveness of Decision Trees22• Decision trees can express any function of the input features.• E.g., for Boolean functions, truth table row → path to leaf:• There is a decision tree which perfectly classifies a training set with one path to leaf for each example • But it won't generalize well to new examples - prefer to find more compact decision treesDecision Trees - OverfittingOne training example per leaf – overfits, need compact/pruned decision tree23Bias-Variance Tradeoff24fine partitioncoarse partitionvariance smallvariance largebias smallbias largeIdeal classifieraverageclassifierClassifiers based on different training dataWhen to Stop?• Many strategies for picking simpler trees:– Pre-pruning• Fixed depth• Fixed number of leaves– Post-pruning• Chi-square test– Convert decision tree to a set of rules– Eliminate variable values in rules which are independent of label (using chi-square test for independence)– Simplify rule set by eliminating unnecessary rules– Information Criteria: MDL(Minimum Description Length)25RefundMarStNOYes NoMarriedSingle, Divorced• Penalize complex models by introducing cost26log likelihood costregressionclassificationpenalize trees with more leavesInformation Criteria5 leaves => 9 bits to encode structureInformation Criteria - MDLPenalize complex models based on their information content.MDL (Minimum Description Length)Example: Binary Decision treesk leaves => 2k – 1 nodes 2k – 1 bits to encode tree structure + k bits to encode label of each leaf (0/1)# bits needed to describe f(description length)So far…• What does a

View Full Document

CMU CS 10701 - Lecture9

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-18-19-36-37-38 out of 38 pages.

CMU CS 10701 - Lecture9

Sign up for free to view:

Please select your school