DOC PREVIEW
UT Dallas CS 6375 - ID3

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Entropy notesWe are considering instances that are identified by feature values. For example, the feature canbe color, and the feature values can be Red, Green, Blue. Probabilities are associated with thelikelihood that an instance has a particular feature value. One can view the probability of a featurevalue as a measure of (un)certainty that an instance has that value. The goal of Entropy is toassociate a measure of uncertainty with a partition of instances according to their feature values.Let S be a partition of the instances into the subsets s1, . . . , sn, according to a feature that canhave n distinct values. Setpi= Prob(instance is in si) ≡ Prob(si)The pivalues form a probability vector.Definition: The vector p = (p1, p2, . . . , pn) is a probability vector if pi≥ 0 and p1+. . .+pn= 1.Let S be a partition.Entropy(S) = the measure of uncertainty about SIf S is partitioned into {s1, . . . , sn} with pi= Prob(si) then:Entropy(S) =nXi=1pilog(1/pi)We typically use base 2 for the log, and then the entropy is measured in bits. It can be shown thatthis form of Entropy is, up to a constant, the only one that satisfies a set of axioms that reflect theintuitive understanding of uncertainty. Examples are the following three axioms:1. Entropy(S) is a continuous function of the pi.2. If p1= p2= . . . = pnthen Entropy(S) is an increasing function of n.3. If a new partition T is formed from S by subdividing one of the subsets of S then Entropy(T ) ≥Entropy(S).ExampleConsider an experiment that produces instances with two possible feature values: a and b. Weview these feature values to be the “target attributes”. Running the experiment 10 times, we geta 6 times, and b 4 times. This enables us to estimate the following probabilities:Case 1: Prob(a) = 0.6, Prob(b) = 0.4.The value 0.6 is the likelihood that the next experiment will produce a. Thus, we are not completelyuncertain about the outcome. We are completely uncertain in Case 2, and there is no uncertaintyin Case 3:Case 2: Prob(a) = 0.5, Prob(b) = 0.5.Case 3: Prob(a) = 1, Prob(b) = 0.Computing the entropy for these cases:Case 1: Entropy(0.6, 0.4) = 0.6 log(1/0.6) + 0.4 log(1/0.4) = 0.970951Case 2: Entropy(0.5, 0.5) = 0.5 log(1/0.5) + 0.5 log(1/0.5) = 1Case 3: Entropy(1, 0) = 1 log(1/1) + 0 log(1/0) = 01Conditional entropyThe conditional entropy of S assuming that it is know that t happened is:Entropy(S|t) =nXi=1pilog(1/pi)where pi= Prob(si|t).Let S = {s1, . . . , sn} and T = {t1, . . . , tm} be two partitions. The uncertainty about S given thatwe know the result on T is:Entropy(S|T ) =mXj=1Prob(tj)Entropy(S|tj) =mXj=1Prob(tj)nXi=1pijlog(1/pij)where pij= Prob(si|tj).ExampleContinuing with the previous example assume that we can see the color of each instance, and thatthe color is one of R,G,B. Here is the table of the experimental results that we observe:color Target attribute1 R a2 R a3 R b4 R a5 G b6 G a7 G a8 G b9 B a10 B bWhat is the entropy if it is known that the color is R? The new estimates for the probabilities andthe entropy are:Prob(a|R) = 3/4, Prob(b|R) = 1/4. Entropy(3/4, 1/4) = 0.811278.What is the entropy if it is known that the color is B?Prob(a|B) = 1/2, Prob(b|B) = 1/2. Entropy(1/2, 1/2) = 1.Observe that the conditional entropy can be smaller or larger than the unconditional entropy. Nowwhat is the entropy if we know the color? It is the weighted sum of the conditional entropies:Entropy(Target|color) = Prob(R)Entropy(Target|R)+ Prob(G)Entropy(Target|G)+ Prob(B)Entropy(Target|B)= 0.4 × 0.811278 + 0.4 × Entropy(0.5, 0.5) + 0.2 × Entropy(0.5, 0.5)= 0.4 × 0.811278 + 0.4 × 1 + 0.2 × 1 = 0.924511It can be shown that this conditional entropy always decreases. Specifically, for any partitions S, T :Entropy(S) ≥ Entropy(S|T ), Gain(S, T ) ≡ Entropy(S) − Entropy(S|T ) ≥ 02Application to feature selectionLet S be a partition of the data. Given several features Ajto choose from, we choose the featureAjwith the minimum Entropy(S|Aj). The “gain” in entropy when using a feature Ajis defined tobe:Gain(S, Aj) = Entropy(S) − Entropy(S|Aj)The feature A to be selected is the one that minimizes Entropy(S|Aj), or, equivalently, the one thatmaximizes Gain(S, Aj).ID3Input: A dataset S.Output: A decision tree. Intermediate nodes in the tree are assigned an attribute (feature), withbranches for each possible attribute value. Leaves in the tree are subsets of uniform valuesfor the target-atribute.1. If all the instances in S have the same value for the target attribute, return.2. Compute Gain values for all candidate-attributes and select an attribute with the largest gain.(Equivalently, compute the entropy conditioned by each one of the candidate-attributes andselect one with the lowest value.)3. Create an intermediate node for that attribute and compute its children (leaves).4. Apply the algorithm (recursively) to each leaf.Example:Consider the following data:MOTOR Wheels Doors Size Efficiency0 no two none small good1 no three none small bad2 yes two none small good3 yes four two small bad4 yes four three medium good5 yes four four medium good6 yes four four large badThe partition of the 7 instances according to Efficiency is: (4,3). The entropy is: 0.985228.The MOTOR feature has two values. The value “no” creates a partition of (1,1) with entropy of1; the value “yes” creates a partition of (3,2) with an entropy if 0.970951. The probability of “no”MOTOR is 2/7, and the probability of “yes” MOTOR is 5/7. Therefore:Entropy(Efficiency|MOTOR) = 2/7 × 1 + 5/7 × 0.970951 = 0.97925Gain(Efficiency, MOTOR) = 0.985228 − 0.97925 = 0.00597758For Wheels we have 3 partitions: { “two”, (2,0) }, { “three”, (0,1) }, { “four, (2,2) }Entropy(Efficiency|Wheels) = 2/7 × 0 + 1/7 × 0 + 4/7 × 1 = 0.571429.3For Doors we have 4 partitions: { “none, (2,1) }, { “two”, (1,0) }, { “three, (1,0) }, { “four, (1,1) }Entropy(Efficiency|Doors) = 3/7 × 0.918296 + 1/7 × 0 + 1/7 × 0 + 2/7 × 1 = 0.67927For Size we have 3 partitions: { “small”, (2,2) }, { “medium, (2,0) }, { “large”, (0,1) }Entropy(Efficiency|Size) = 4/7 × 1 + 2/7 × 0 + 1/7 × 0 = 0.571429Therefore we should choose Wheels or Size as our first attribute.Choosing Wheels as the first attribute splits the data into three classes. Two of them need notbe partitioned further. The one that does is composed of instances {3,4,5,6}. Proceeding as abovewe see that the next attribute is


View Full Document

UT Dallas CS 6375 - ID3

Documents in this Course
ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw5

hw5

2 pages

hw3

hw3

3 pages

20.mdp

20.mdp

19 pages

19.em

19.em

17 pages

16.svm-2

16.svm-2

16 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

21.rl

21.rl

18 pages

CNF-DNF

CNF-DNF

2 pages

mlHw6

mlHw6

3 pages

MLHW3

MLHW3

4 pages

MLHW4

MLHW4

3 pages

ML-HW2

ML-HW2

3 pages

vcdimCMU

vcdimCMU

20 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

Load more
Download ID3
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view ID3 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ID3 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?