CMU CS 15780 - Decision trees - D2157368

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15780> Decision trees

DOC PREVIEW

CMU CS 15780 - Decision trees

School name Carnegie Mellon University

Course Cs 15780- Graduate Artificial Intelligence

Pages 41

This preview shows page 1-2-3-19-20-39-40-41 out of 41 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

15-780: Graduate ArtificialIntelligenceDecision treesGraphical models• So far we discussed models that capture joint probabilitydistributions• These have many uses, and can also be used todetermine binary values for variables - For example, did a burglary occur?• However, they also require us to make manyassumptions and to fit many parameters: - model structure - probability model - model parametersClassification• In many cases we are only interested in one specificvariable.• Examples: - Does the robot have to turn? Slow down? - What digit is in each of the squares? - Does the patient have cancer?Generative vs. discriminativemodels• Graphical models can be used for classification - They represent a subset of classifiers known as‘generative models’• But we can also design classifiers that are more specificto a given task and do not require an estimation of jointprobabilities - These are often referred to as discriminative models• Examples: - Support vector machines (SVM) - Decision treesDecision trees• One of the most intuitive classifiers• Easy to understand and construct• Surprisingly, also works very (very) well** More on this towards the endof this lectureLets build a decision tree!Structure of a decision treeACIFyes noyesyesnoA age > 26I income > 40KC citizenF female10• Internal nodescorrespond to attributes(features)• Leafs correspond toclassification outcome• edges denoteassignmentNetflixDatasetAttributes (features)LabelYesNoLasseterMediumDramam9YesYesAdamsonLongComedym8YesNoSingerShortanimatedm7YesYesSingerMediumDramam6NoYesLasseterLongComedym5NoYesLasseterlonganimatedm4YesNoAdamsonMediumDramam3NoNoLasseterShortAnimatedm2YesNoAdamsonShortComedym1Liked?Famous actorsDirectorLengthTypeMovieBuilding a decision treeFunction BuildTree(n,A) // n: samples (rows), A: attributes If empty(A) or all n(L) are the same status = leaf class = most common class in n(L) else status = internal a ⇐ bestAttribute(n,A) LeftNode = BuildTree(n(a=1), A \ {a}) RightNode = BuildTree(n(a=0), A \ {a}) endendBuilding a decision treeFunction BuildTree(n,A) // n: samples (rows), A: attributes If empty(A) or all n(L) are the same status = leaf class = most common class in n(L) else status = internal a ⇐ bestAttribute(n,A) LeftNode = BuildTree(n(a=1), A \ {a}) RightNode = BuildTree(n(a=0), A \ {a}) endendn(L): Labels for samples inthis setWe will discuss this functionnextRecursive calls to create leftand right subtrees, n(a=1) isthe set of samples in n forwhich the attribute a is 1Identifying ‘bestAttribute’• There are many possible ways to select the bestattribute for a given set.• We will discuss one possible way which is based oninformation theory and generalizes well to non binaryvariablesEntropy• Quantifies the amount of uncertaintyassociated with a specific probabilitydistribution• The higher the entropy, the lessconfident we are in the outcome• DefinitionClaude Shannon (1916 –2001), most of the work wasdone in Bell labs)(log)()(2cXpcXpXHc==!="Entropy• Definition• So, if P(X=1) = 1 then• If P(X=1) = .5 then)(log)()(2iXpiXpXHi==!="00log01log1)0(log)0()1(log)1()(22=!!===!==!= XpxpXpxpXH15.log5.log5.5.log5.)0(log)0()1(log)1()(22222=!=!!===!==!= XpxpXpxpXHInterpreting entropy• Entropy can be interpreted from an informationstandpoint• Assume both sender and receiver know the distribution.How many bits, on average, would it take to transmit onevalue?• If P(X=1) = 1 then the answer is 0 (we don’t need totransmit anything)• If P(X=1) = .5 then the answer is 1 (either values isequally likely)• If 0<P(X=1)<.5 or 0.5<P(X=1)<1 then the answer isbetween 0 and 1 - Why?Expected bits per symbol• Assume P(X=1) = 0.8• Then P(11) = 0.64, P(10)=P(01)=.16 and P(00)=.04• Lets define the following code - For 11 we send 0 - For 10 we send 01 - For 01 we send 011 - For 00 we send 0111Expected bits per symbol• Assume P(X=1) = 0.8• Then P(11) = 0.64, P(10)=P(01)=.16 and P(00)=.04• Lets define the following code - For 11 we send 0 - For 10 we send 10 - For 01 we send 110 - For 00 we send 1110• What is the expected bits / symbol?(.64*1+.16*2+.16*3+.04*4)/2 = 0.8• Entropy (lower bound) H(X)=0.7219so: 01001101110001101110can be broken to: 0 10 0 110 1110 0 0 110 1110which is: 11 10 11 01 00 11 11 01 00Conditional entropyYesYesYesYesNoNoYesNoYesLiked?MediumLongShortMediumLonglongMediumShortShortMovielength• Entropy measures the uncertainty in aspecific distribution• What if both sender and receiver knowsomething about the transmission?• For example, say I want to send the label(liked) when the length is known• This becomes a conditional entropyproblem: H(Li | Le=v)Is the entropy of Liked among movies withlength vConditional entropy: Examples forspecific valuesYesYesYesYesNoNoYesNoYesLiked?MediumLongShortMediumLonglongMediumShortShortMovielengthLets compute H(Li | Le=v)1. H(Li | Le = S) = .92Conditional entropy: Examples forspecific valuesYesYesYesYesNoNoYesNoYesLiked?MediumLongShortMediumLonglongMediumShortShortMovielengthLets compute H(Li | Le=v)1. H(Li | Le = S) = .922. H(Li | Le = M) = 03. H(Li | Le = L) = .92Conditional entropyYesYesYesYesNoNoYesNoYesLiked?MediumLongShortMediumLonglongMediumShortShortMovielength• We can generalize the conditional entropyidea to determine H( Li | Le)• That is, what is the expected number ofbits we need to transmit if both sides knowthe value of Le for each of the records(samples)• Definition:!===iiYXHiYPYXH )|()()|(We explained how to compute this inthe previous slidesConditional entropy: ExampleYesYesYesYesNoNoYesNoYesLiked?MediumLongShortMediumLonglongMediumShortShortMovielength• Lets compute H( Li | Le)H( Li | Le) = P( Le = S) H( Li | Le=S)+ P( Le = M) H( Li | Le=M)+ P( Le = L) H( Li | Le=L) = 1/3*.92+1/3*0+1/3*.92 = 0.61!===iiYXHiYPYXH )|()()|(we already computed:H(Li | Le = S) = .92H(Li | Le = M) = 0H(Li | Le = L) = .92Information gain• How much do we gain (in terms of reduction in entropy)from knowing one of the attributes• In other words, what is the reduction in entropy from thisknowledge• Definition: IG(X|Y)* = H(X)-H(X|Y)*IG(X|Y) is always ≥ 0Proof: Jensen inequalityWhere we are• We were looking for a good criteria for selecting the

View Full Document