Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur TastanOutlineDecision treesAnatomy of a decision treeSlide 5Slide 6To ‘play tennis’ or not.Slide 8Slide 9RepresentationSame concept different representationWhich attribute to select for splitting?How do we choose the test ?Information GainInformationSlide 16EntropySlide 18Entropy, purityConditional entropyInformation gainSlide 22ExampleWhich one do we choose?Recurse on branchesCaveatsPurity (diversity) measuresOverfittingBaggingBootstrapSlide 31Random Forest ClassifierSlide 33Slide 34Bagging : an simulated exampleBaggingSlide 37Random forest classifierSlide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Random forestMachine Learning 10601Recitation 8Oct 21, 2009Oznur TastanOutline•Tree representation•Brief information theory•Learning decision trees•Bagging •Random forestsDecision trees•Non-linear classifier•Easy to use •Easy to interpret•Succeptible to overfitting but can be avoided.Anatomy of a decision treeovercasthigh normalfalsetruesunnyrainNo NoYes YesYesOutlookHumidityWindyEach node is a test on one attributePossible attribute values of the nodeLeafs are the decisionsAnatomy of a decision treeovercasthigh normalfalsetruesunnyrainNo NoYes YesYesOutlookHumidityWindyEach node is a test on one attributePossible attribute values of the nodeLeafs are the decisionsAnatomy of a decision treeovercasthigh normalfalsetruesunnyrainNo NoYes YesYesOutlookHumidityWindyEach node is a test on one attributePossible attribute values of the nodeLeafs are the decisionsSample size Your data gets smallerTo ‘play tennis’ or not.overcasthigh normalfalsetruesunnyrainNo NoYes YesYesOutlookHumidityWindyA new test example:(Outlook==rain) and (not Windy==false)Pass it on the tree-> Decision is yes.To ‘play tennis’ or not.overcasthigh normalfalsetruesunnyrainNo NoYes YesYesOutlookHumidityWindy(Outlook ==overcast) -> yes(Outlook==rain) and (not Windy==false) ->yes(Outlook==sunny) and (Humidity=normal) ->yesDecision treesDecision trees represent a disjunction ofconjunctions of constraints on the attribute values ofinstances.(Outlook ==overcast) OR((Outlook==rain) and (not Windy==false)) OR((Outlook==sunny) and (Humidity=normal)) => yes play tennisRepresentation0A1C B0 110false true falseY=((A and B) or ((not A) and C))trueSame concept different representation0A1C B0 110false true falseY=((A and B) or ((not A) and C))true0C1BA0 101false truefalseA01true falseWhich attribute to select for splitting?16 +16 -8 +8 -8 +8 -4 +4 -4 +4 -4 +4 -4 +4 -2 +2 -2 +2 -This is bad splitting…the distribution of each class (not attribute)How do we choose the test ?Which attribute should be used as the test?Intuitively, you would prefer the one that separates the trainingexamples as much as possible.Information GainInformation gain is one criteria to decide on theattribute.InformationImagine:1. Someone is about to tell you your own name2. You are about to observe the outcome of a dice roll2. You are about to observe the outcome of a coin flip3. You are about to observe the outcome of a biased coin flipEach situation have a different amount of uncertaintyas to what outcome you will observe.InformationInformation:reduction in uncertainty (amount of surprise in the outcome)2 21( ) log log ( )( )I E p xp x= =-•Observing the outcome of a coin flip is head2. Observe the outcome of a dice is 62log 1/ 2 1I =- =2log 1/ 6 2.58I =- =If the probability of this event happening is small and it happens the information is large.Entropy The expected amount of information when observing the output of a random variable X2( ) ( ( )) ( ) ( ) ( )log ( )i i i ii iH X E I X p x I x p x p x= = =-� �If there X can have 8 outcomes and all are equally likely2( ) 1/ 8log 1/ 8 3iH X ==- =�bitsEntropyIf there are k possible outcomesEquality holds when all outcomes are equally likely The more the probability distributionthe deviates from uniformitythe lower the entropy2( ) logH X k�Entropy, purityEntropy measures the purity 4 +4 -8 +0 -The distribution is less uniformEntropy is lowerThe node is purerConditional entropy2( ) ( )log ( )i iiH X p x p x=-�( | ) ( ) ( | )j jjH X Y p y H X Y y=- =�2( ) ( | ) log ( | )j i j i jj ip y p x y p x y=-� �Information gainIG(X,Y)=H(X)-H(X|Y)Reduction in uncertainty by knowing YInformation gain: (information before split) – (information after split)Information GainInformation gain: (information before split) – (information after split)ExampleX1 X2 Y CountT T + 2T F + 2F T - 5F F + 1Attributes LabelsIG(X1,Y) = H(Y) – H(Y|X1)H(Y) = - (5/10) log(5/10) -5/10log(5/10) = 1H(Y|X1) = P(X1=T)H(Y|X1=T) + P(X1=F) H(Y|X1=F) = 4/10 (1log 1 + 0 log 0) +6/10 (5/6log 5/6 +1/6log1/6) = 0.39Information gain (X1,Y)= 1-0.39=0.61Which one do we choose X1 or X2?Which one do we choose?X1 X2 Y CountT T + 2T F + 2F T - 5F F + 1Information gain (X1,Y)= 0.61Information gain (X2,Y)= 0.12Pick X1Pick the variable which provides the most information gain about YRecurse on branchesX1 X2 Y CountT T + 2T F + 2F T - 5F F + 1One branchThe other branchCaveats•The number of possible values influences the information gain.•The more possible values, the higher the gain (the more likely it is to form small, but pure partitions)Purity (diversity) measuresPurity (Diversity) Measures:– Gini (population diversity)– Information Gain– Chi-square TestOverfitting•You can perfectly fit to any training data•Zero bias, high varianceTwo approaches:1. Stop growing the tree when further splitting the data does not yield an improvement2. Grow a full tree, then prune the tree, by eliminating nodes.Bagging•Bagging or bootstrap aggregation a technique for reducing the variance of an estimated prediction function. •For classification, a committee of trees each cast a vote for the predicted class.BootstrapThe basic idea:randomly draw datasets with replacement from the training data, each sample the same size as the original training setBaggingN examplesCreate bootstrap samplesfrom the training data ....…M featuresRandom Forest ClassifierN examplesConstruct a decision tree....…M featuresRandom Forest ClassifierN examples....…....…Take the majority voteM featuresBaggingZ = {(x1, y1), (x2, y2), . . . , (xN, yN)}Z*b where = 1,.., B.. The prediction at input x when bootstrap sample b is used for traininghttp://www-stat.stanford.edu/~hastie/Papers/ESLII.pdf (Chapter 8.7)Bagging : an simulated exampleGenerated
View Full Document