Decision Tree Learning[read Chapter 3][recommended exercises 3.1, 3.4]Decision tree representationID3 learning algorithmEntropy, Information gainOvertting46 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997Decision Tree forP lay T ennisOutlookOvercastHumidityNormalHighNo YesWindStrong WeakNo YesYesRainSunny47 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997A Tree to Predict C-Section RiskLearned from medical records of 1000 womenNegative examples are C-sections[833+,167-] .83+ .17-Fetal_Presentation = 1: [822+,116-] .88+ .12-| Previous_Csection = 0: [767+,81-] .90+ .10-| | Primiparous = 0: [399+,13-] .97+ .03-| | Primiparous = 1: [368+,68-] .84+ .16-| | | Fetal_Distress = 0: [334+,47-] .88+ .12-| | | | Birth_Weight < 3349: [201+,10.6-] .95+ .05-| | | | Birth_Weight >= 3349: [133+,36.4-] .78+ .22-| | | Fetal_Distress = 1: [34+,21-] .62+ .38-| Previous_Csection = 1: [55+,35-] .61+ .39-Fetal_Presentation = 2: [3+,29-] .11+ .89-Fetal_Presentation = 3: [8+,22-] .27+ .73-48 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997Decision TreesDecision tree representation:Each internal node tests an attributeEach branch corresp onds to attribute valueEach leaf node assigns a classicati onHow would we represent: ^;_;XOR(A^B)_(C^ :D^E)MofN49 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997When to Consider Decision TreesInstances describable by attribute{value pairsTarget function is discrete valuedDisjunctive hyp othesis may b e requiredPossibly noisy training dataExamples:Equipment or medical diagnosisCredit risk analysisMo deling calendar scheduling preferences50 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997Top-Down Induction of Decision TreesMain lo op:1.A the \b est" decision attribute for nextnode2. AssignAas decision attribute fornode3. For each value ofA, create new descendant ofnode4. Sort training examples to leaf nodes5. If training examples perfectly classied, ThenSTOP, Else iterate over new leaf nodesWhich attribute is b est?A1=? A2=?ft ft[29+,35-] [29+,35-][21+,5-] [8+,30-] [18+,33-] [11+,2-]51 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997EntropyEntropy(S)1.00.50.0 0.5 1.0p+Sis a sample of training examplespis the prop ortion of p ositive examples inSpis the prop ortion of negative examples inSEntropy measures the impurity ofSE ntr opy(S) ?plog2p?plog2p52 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997EntropyE ntr opy(S) = expected numb er of bits needed toenco de class (or) of randomly drawnmemb er ofS(under the optimal, shortest-lengthco de)Why?Information theory: optimal length code assigns?log2pbits to message having probabilityp.So, exp ected numb er of bits to encodeorofrandom member ofS:p(?log2p) +p(?log2p)E ntr opy(S) ?plog2p?plog2p53 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997Information GainGain(S; A) = exp ected reduction in entropy due tosorting onAGain(S; A)E ntr opy(S)?Xv2V alues(A)jSvjjSjE ntr opy(Sv)A1=? A2=?ft ft[29+,35-] [29+,35-][21+,5-] [8+,30-] [18+,33-] [11+,2-]54 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997Training ExamplesDay Outlo ok Temp erature Humidity Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Co ol Normal Weak YesD6 Rain Co ol Normal Strong NoD7 Overcast Co ol Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Co ol Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No55 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997Selecting the Next AttributeWhich attribute is the best classifier?High NormalHumidity[3+,4-] [6+,1-]WindWeak Strong[6+,2-] [3+,3-] = .940 - (7/14).985 - (7/14).592 = .151 = .940 - (8/14).811 - (6/14)1.0 = .048Gain (S, Humidity ) Gain (S, )Wind=0.940E=0.940E=0.811E=0.592E=0.985E=1.00E[9+,5-]S:[9+,5-]S:56 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997OutlookSunny Overcast Rain[9+,5−]{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}[2+,3−] [4+,0−] [3+,2−]Yes{D1, D2, ..., D14}? ?Which attribute should be tested here?Ssunny= {D1,D2,D8,D9,D11}Gain (Ssunny, Humidity)sunnyGain (S , Temperature)= .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570Gain (S sunny, Wind)= .970 − (2/5) 1.0 − (3/5) .918 = .019 = .970 − (3/5) 0.0 − (2/5) 0.0 = .97057 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997Hyp othesis Space Search by ID3...+ ++A1+ – + –A2A3+...+ – + –A2A4–+ – + –A2+ – +......–58 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997Hyp othesis Space Search by ID3Hyp othesis space is complete!{Target function surely in there...Outputs a single hypothesis (which one?){Can't play 20 questions...No back tracking{Lo cal minima...Statisicall y-based search choices{Robust to noisy data...Inductive bias: approx \prefer shortest tree"59 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997Inductive Bias in ID3NoteHis the p ower set of instancesX!Unbiased?Not really...Preference for short trees, and for those withhigh information gain attributes near the ro otBias is apreferencefor some hyp otheses, ratherthan arestrictionof hypothesis spaceHOccam's razor: prefer the shortest hyp othesisthat ts the data60 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997Occam's RazorWhy prefer short hyp otheses?Argument in favor:Fewer short hyps. than long hyps.!a short hyp that ts data unlikely to becoincidence!a long hyp that ts data might b e coincidenceArgument opp osed:There are many ways to dene small sets of hypse.g., all trees with a prime number of nodes thatuse attributes b eginning with \Z"What's so special ab out small sets based onsizeof hypothesis??61 lecture slides for textbo okMachine Learning,cTom M. Mitchell, McGraw Hill, 1997Overtting in Decision TreesConsider adding
View Full Document