15-381: Artificial IntelligenceDecision trees• Bayes classifiers find the label that maximizes:• Naïve Bayes models assume independence of the features given thelabel leading to the following over documents and labels:Bayes and Naïve Bayes Classifier)()|(maxarg)()()|(maxarg)|(maxargˆvypvyppvypvypvypyvvv==!=!==!=!==! ˆ y = argmaxvp(" | y = v) p(y = v) = p(#ii$| y = v,i)% & ' ( ) * p(y = v)Here is a dataset48,000 records, 16 attributes [Kohavi 1995]ageemploymenteducationedunummarital … job relation race genderhours_workedcountry wealth…39 State_gov Bachelors 13Never_married…Adm_clericalNot_in_familyWhite Male 40United_Statespoor51Self_emp_not_incBachelors 13 Married …Exec_managerialHusband White Male 13United_Statespoor39 Private HS_grad 9 Divorced …Handlers_cleanersNot_in_familyWhite Male 40United_Statespoor54 Private 11th 7 Mar ried …Handlers_cleanersHusband Black Male 40United_Statespoor28 Private Bachelors 13 Married …Prof_specialtyWife Black Female 40 Cuba poor38 Private Masters 14 Married …Exec_managerialWife White Female 40United_Statespoor50 Private 9th 5Married_spouse_absent…Other_serviceNot_in_familyBlack Female 16 Jamaica poor52Self_emp_not_incHS_grad 9 Married …Exec_managerialHusband White Male 45United_Statesrich31 Private Masters 14Never_married…Prof_specialtyNot_in_familyWhite Female 50United_Statesrich42 Private Bachelors 13 Married …Exec_managerialHusband White Male 40United_Statesrich37 PrivateSome_college10 Married …Exec_managerialHusband Black Male 80United_Statesrich30 State_gov Bachelors 13 Married …Prof_specialtyHusband Asian Male 40 India rich24 Private Bachelors 13Never_married…Adm_clericalOwn_child White Female 30United_Statespoor33 PrivateAssoc_acdm12Never_married… SalesNot_in_familyBlack Male 50United_Statespoor41 PrivateAssoc_voc11 Married …Craft_repairHusband Asian Male 40*MissingValue*rich34 Private 7th_8th 4 Mar ried …Transport_movingHusbandAmer_IndianMale 45 Mexico poor26Self_emp_not_incHS_grad 9Never_married…Farming_fishingOwn_child White Male 35United_Statespoor33 Private HS_grad 9Never_married…Machine_op_inspctUnmarried White Male 40United_Statespoor38 Private 11th 7 Mar ried … Sales Husband White Male 50United_Statespoor44Self_emp_not_incMasters 14 Divorced …Exec_managerialUnmarried White Female 45United_Statesrich41 Private Doctorate 16 Married …Prof_specialtyHusband White Male 60United_Statesrich: : : : : : : : : : : : :: : : : : : : : : : : : :: : : : : : : : : : : : :: : : : : : : : : : : : :Predicting wealth from ageWealth from years of educationage, hours → wealthage, hours → wealthHaving 2 inputsinstead of one helpsin two ways:1. Combiningevidence from two 1dGaussians2. Off-diagonalcovariancedistinguishes class“shape”age, edunum → wealthage, edunum → wealthAccuracyGenerative classification models• So far we discussed models that capture joint probabilitydistributions• These have many uses, and as we discussed they canbe used to determine binary values for variables• However, they also rely on specific assumptions andrequire the estimation of many parameters: - model structure - probability model - model parametersGenerative vs. discriminativemodels• Graphical models can be used for classification - They represent a subset of classifiers known as‘generative models’• But we can also design classifiers that are more specificto a given task and do not require an estimation of jointprobabilities - These are often referred to as discriminative models• Examples: - Support vector machines (SVM) - Decision treesDecision trees• One of the most intuitive classifiers• Easy to understand and construct• Surprisingly, also works very (very) well** More on this towards the endof this lectureLets build a decision tree!Structure of a decision treeACIFyes noyesyesnoA age > 26I income > 40KC citizenF female10• Internal nodescorrespond to attributes(features)• Leafs correspond toclassification outcome• edges denoteassignmentNetflixDatasetAttributes (features)LabelYesNoLasseterMediumDramam9YesYesAdamsonLongComedym8YesNoSingerShortanimatedm7YesYesSingerMediumDramam6NoYesLasseterLongComedym5NoYesLasseterlonganimatedm4YesNoAdamsonMediumDramam3NoNoLasseterShortAnimatedm2YesNoAdamsonShortComedym1Liked?Famous actorsDirectorLengthTypeMovieBuilding a decision treeFunction BuildTree(n,A) // n: samples (rows), A: attributes If empty(A) or all n(L) are the same status = leaf class = most common class in n(L) else status = internal a ⇐ bestAttribute(n,A) LeftNode = BuildTree(n(a=1), A \ {a}) RightNode = BuildTree(n(a=0), A \ {a}) endendBuilding a decision treeFunction BuildTree(n,A) // n: samples (rows), A: attributes If empty(A) or all n(L) are the same status = leaf class = most common class in n(L) else status = internal a ⇐ bestAttribute(n,A) LeftNode = BuildTree(n(a=1), A \ {a}) RightNode = BuildTree(n(a=0), A \ {a}) endendn(L): Labels for samples inthis setWe will discuss this functionnextRecursive calls to create leftand right subtrees, n(a=1) isthe set of samples in n forwhich the attribute a is 1Identifying ‘bestAttribute’• There are many possible ways to select the bestattribute for a given set.• We will discuss one possible way which is based oninformation theory and generalizes well to non binaryvariablesEntropy• Quantifies the amount of uncertaintyassociated with a specific probabilitydistribution• The higher the entropy, the lessconfident we are in the outcome• DefinitionClaude Shannon (1916 –2001), most of the work wasdone in Bell labs)(log)()(2cXpcXpXHc==!="Entropy• Definition• So, if P(X=1) = 1 then• If P(X=1) = .5 then)(log)()(2iXpiXpXHi==!="00log01log1)0(log)0()1(log)1()(22=!!===!==!= XpxpXpxpXH15.log5.log5.5.log5.)0(log)0()1(log)1()(22222=!=!!===!==!= XpxpXpxpXHInterpreting entropy• Entropy can be interpreted from an informationstandpoint• Assume both sender and receiver know the distribution.How many bits, on average, would it take to transmit onevalue?• If P(X=1) = 1 then the answer is 0 (we don’t need totransmit anything)• If P(X=1) = .5 then the answer is 1 (either values isequally likely)• If 0<P(X=1)<.5 or 0.5<P(X=1)<1 then the answer isbetween 0 and 1 - Why?Expected bits per symbol• Assume P(X=1) = 0.8• Then P(11) = 0.64, P(10)=P(01)=.16 and P(00)=.04• Lets define the following code -
View Full Document