15-780: Graduate ArtificialIntelligenceDecision treesGraphical models• So far we discussed models that capture joint probabilitydistributions• These have many uses, and can also be used todetermine binary values for variables - For example, did a burglary occur?• However, they also require us to make manyassumptions and to fit many parameters: - model structure - probability model - model parametersClassification• In many cases we are only interested in one specificvariable.• Examples: - Does the robot have to turn? Slow down? - What digit is in each of the squares? - Does the patient have cancer?Generative vs. discriminativemodels• Graphical models can be used for classification - They represent a subset of classifiers known as‘generative models’• But we can also design classifiers that are more specificto a given task and do not require an estimation of jointprobabilities - These are often referred to as discriminative models• Examples: - Support vector machines (SVM) - Decision treesDecision trees• One of the most intuitive classifiers• Easy to understand and construct• Surprisingly, also works very (very) well** More on this towards the endof this lectureLets build a decision tree!Structure of a decision treeACIFyes noyesyesnoA age > 26I income > 40KC citizenF female10• Internal nodescorrespond to attributes(features)• Leafs correspond toclassification outcome• edges denoteassignmentNetflixDatasetAttributes (features)LabelYesNoLasseterMediumDramam9YesYesAdamsonLongComedym8YesNoSingerShortanimatedm7YesYesSingerMediumDramam6NoYesLasseterLongComedym5NoYesLasseterlonganimatedm4YesNoAdamsonMediumDramam3NoNoLasseterShortAnimatedm2YesNoAdamsonShortComedym1Liked?Famous actorsDirectorLengthTypeMovieBuilding a decision treeFunction BuildTree(n,A) // n: samples (rows), A: attributes If empty(A) or all n(L) are the same status = leaf class = most common class in n(L) else status = internal a ⇐ bestAttribute(n,A) LeftNode = BuildTree(n(a=1), A \ {a}) RightNode = BuildTree(n(a=0), A \ {a}) endendBuilding a decision treeFunction BuildTree(n,A) // n: samples (rows), A: attributes If empty(A) or all n(L) are the same status = leaf class = most common class in n(L) else status = internal a ⇐ bestAttribute(n,A) LeftNode = BuildTree(n(a=1), A \ {a}) RightNode = BuildTree(n(a=0), A \ {a}) endendn(L): Labels for samples inthis setWe will discuss this functionnextRecursive calls to create leftand right subtrees, n(a=1) isthe set of samples in n forwhich the attribute a is 1Identifying ‘bestAttribute’• There are many possible ways to select the bestattribute for a given set.• We will discuss one possible way which is based oninformation theory and generalizes well to non binaryvariablesEntropy• Quantifies the amount of uncertaintyassociated with a specific probabilitydistribution• The higher the entropy, the lessconfident we are in the outcome• DefinitionClaude Shannon (1916 –2001), most of the work wasdone in Bell labs)(log)()(2cXpcXpXHc==!="Entropy• Definition• So, if P(X=1) = 1 then• If P(X=1) = .5 then)(log)()(2iXpiXpXHi==!="00log01log1)0(log)0()1(log)1()(22=!!===!==!= XpxpXpxpXH15.log5.log5.5.log5.)0(log)0()1(log)1()(22222=!=!!===!==!= XpxpXpxpXHInterpreting entropy• Entropy can be interpreted from an informationstandpoint• Assume both sender and receiver know the distribution.How many bits, on average, would it take to transmit onevalue?• If P(X=1) = 1 then the answer is 0 (we don’t need totransmit anything)• If P(X=1) = .5 then the answer is 1 (either values isequally likely)• If 0<P(X=1)<.5 or 0.5<P(X=1)<1 then the answer isbetween 0 and 1 - Why?Expected bits per symbol• Assume P(X=1) = 0.8• Then P(11) = 0.64, P(10)=P(01)=.16 and P(00)=.04• Lets define the following code - For 11 we send 0 - For 10 we send 01 - For 01 we send 011 - For 00 we send 0111Expected bits per symbol• Assume P(X=1) = 0.8• Then P(11) = 0.64, P(10)=P(01)=.16 and P(00)=.04• Lets define the following code - For 11 we send 0 - For 10 we send 10 - For 01 we send 110 - For 00 we send 1110• What is the expected bits / symbol?(.64*1+.16*2+.16*3+.04*4)/2 = 0.8• Entropy (lower bound) H(X)=0.7219so: 01001101110001101110can be broken to: 0 10 0 110 1110 0 0 110 1110which is: 11 10 11 01 00 11 11 01 00Conditional entropyYesYesYesYesNoNoYesNoYesLiked?MediumLongShortMediumLonglongMediumShortShortMovielength• Entropy measures the uncertainty in aspecific distribution• What if both sender and receiver knowsomething about the transmission?• For example, say I want to send the label(liked) when the length is known• This becomes a conditional entropyproblem: H(Li | Le=v)Is the entropy of Liked among movies withlength vConditional entropy: Examples forspecific valuesYesYesYesYesNoNoYesNoYesLiked?MediumLongShortMediumLonglongMediumShortShortMovielengthLets compute H(Li | Le=v)1. H(Li | Le = S) = .92Conditional entropy: Examples forspecific valuesYesYesYesYesNoNoYesNoYesLiked?MediumLongShortMediumLonglongMediumShortShortMovielengthLets compute H(Li | Le=v)1. H(Li | Le = S) = .922. H(Li | Le = M) = 03. H(Li | Le = L) = .92Conditional entropyYesYesYesYesNoNoYesNoYesLiked?MediumLongShortMediumLonglongMediumShortShortMovielength• We can generalize the conditional entropyidea to determine H( Li | Le)• That is, what is the expected number ofbits we need to transmit if both sides knowthe value of Le for each of the records(samples)• Definition:!===iiYXHiYPYXH )|()()|(We explained how to compute this inthe previous slidesConditional entropy: ExampleYesYesYesYesNoNoYesNoYesLiked?MediumLongShortMediumLonglongMediumShortShortMovielength• Lets compute H( Li | Le)H( Li | Le) = P( Le = S) H( Li | Le=S)+ P( Le = M) H( Li | Le=M)+ P( Le = L) H( Li | Le=L) = 1/3*.92+1/3*0+1/3*.92 = 0.61!===iiYXHiYPYXH )|()()|(we already computed:H(Li | Le = S) = .92H(Li | Le = M) = 0H(Li | Le = L) = .92Information gain• How much do we gain (in terms of reduction in entropy)from knowing one of the attributes• In other words, what is the reduction in entropy from thisknowledge• Definition: IG(X|Y)* = H(X)-H(X|Y)*IG(X|Y) is always ≥ 0Proof: Jensen inequalityWhere we are• We were looking for a good criteria for selecting the
View Full Document