Entropy notesWe are considering instances that are identified by feature values. For example, the feature canbe color, and the feature values can be Red, Green, Blue. Probabilities are associated with thelikelihood that an instance has a particular feature value. One can view the probability of a featurevalue as a measure of (un)certainty that an instance has that value. The goal of Entropy is toassociate a measure of uncertainty with a partition of instances according to their feature values.Let S be a partition of the instances into the subsets s1, . . . , sn, according to a feature that canhave n distinct values. Setpi= Prob(instance is in si) ≡ Prob(si)The pivalues form a probability vector.Definition: The vector p = (p1, p2, . . . , pn) is a probability vector if pi≥ 0 and p1+. . .+pn= 1.Let S be a partition.Entropy(S) = the measure of uncertainty about SIf S is partitioned into {s1, . . . , sn} with pi= Prob(si) then:Entropy(S) =nXi=1pilog(1/pi)We typically use base 2 for the log, and then the entropy is measured in bits. It can be shown thatthis form of Entropy is, up to a constant, the only one that satisfies a set of axioms that reflect theintuitive understanding of uncertainty. Examples are the following three axioms:1. Entropy(S) is a continuous function of the pi.2. If p1= p2= . . . = pnthen Entropy(S) is an increasing function of n.3. If a new partition T is formed from S by subdividing one of the subsets of S then Entropy(T ) ≥Entropy(S).ExampleConsider an experiment that produces instances with two possible feature values: a and b. Weview these feature values to be the “target attributes”. Running the experiment 10 times, we geta 6 times, and b 4 times. This enables us to estimate the following probabilities:Case 1: Prob(a) = 0.6, Prob(b) = 0.4.The value 0.6 is the likelihood that the next experiment will produce a. Thus, we are not completelyuncertain about the outcome. We are completely uncertain in Case 2, and there is no uncertaintyin Case 3:Case 2: Prob(a) = 0.5, Prob(b) = 0.5.Case 3: Prob(a) = 1, Prob(b) = 0.Computing the entropy for these cases:Case 1: Entropy(0.6, 0.4) = 0.6 log(1/0.6) + 0.4 log(1/0.4) = 0.970951Case 2: Entropy(0.5, 0.5) = 0.5 log(1/0.5) + 0.5 log(1/0.5) = 1Case 3: Entropy(1, 0) = 1 log(1/1) + 0 log(1/0) = 01Conditional entropyThe conditional entropy of S assuming that it is know that t happened is:Entropy(S|t) =nXi=1pilog(1/pi)where pi= Prob(si|t).Let S = {s1, . . . , sn} and T = {t1, . . . , tm} be two partitions. The uncertainty about S given thatwe know the result on T is:Entropy(S|T ) =mXj=1Prob(tj)Entropy(S|tj) =mXj=1Prob(tj)nXi=1pijlog(1/pij)where pij= Prob(si|tj).ExampleContinuing with the previous example assume that we can see the color of each instance, and thatthe color is one of R,G,B. Here is the table of the experimental results that we observe:color Target attribute1 R a2 R a3 R b4 R a5 G b6 G a7 G a8 G b9 B a10 B bWhat is the entropy if it is known that the color is R? The new estimates for the probabilities andthe entropy are:Prob(a|R) = 3/4, Prob(b|R) = 1/4. Entropy(3/4, 1/4) = 0.811278.What is the entropy if it is known that the color is B?Prob(a|B) = 1/2, Prob(b|B) = 1/2. Entropy(1/2, 1/2) = 1.Observe that the conditional entropy can be smaller or larger than the unconditional entropy. Nowwhat is the entropy if we know the color? It is the weighted sum of the conditional entropies:Entropy(Target|color) = Prob(R)Entropy(Target|R)+ Prob(G)Entropy(Target|G)+ Prob(B)Entropy(Target|B)= 0.4 × 0.811278 + 0.4 × Entropy(0.5, 0.5) + 0.2 × Entropy(0.5, 0.5)= 0.4 × 0.811278 + 0.4 × 1 + 0.2 × 1 = 0.924511It can be shown that this conditional entropy always decreases. Specifically, for any partitions S, T :Entropy(S) ≥ Entropy(S|T ), Gain(S, T ) ≡ Entropy(S) − Entropy(S|T ) ≥ 02Application to feature selectionLet S be a partition of the data. Given several features Ajto choose from, we choose the featureAjwith the minimum Entropy(S|Aj). The “gain” in entropy when using a feature Ajis defined tobe:Gain(S, Aj) = Entropy(S) − Entropy(S|Aj)The feature A to be selected is the one that minimizes Entropy(S|Aj), or, equivalently, the one thatmaximizes Gain(S, Aj).ID3Input: A dataset S.Output: A decision tree. Intermediate nodes in the tree are assigned an attribute (feature), withbranches for each possible attribute value. Leaves in the tree are subsets of uniform valuesfor the target-atribute.1. If all the instances in S have the same value for the target attribute, return.2. Compute Gain values for all candidate-attributes and select an attribute with the largest gain.(Equivalently, compute the entropy conditioned by each one of the candidate-attributes andselect one with the lowest value.)3. Create an intermediate node for that attribute and compute its children (leaves).4. Apply the algorithm (recursively) to each leaf.Example:Consider the following data:MOTOR Wheels Doors Size Efficiency0 no two none small good1 no three none small bad2 yes two none small good3 yes four two small bad4 yes four three medium good5 yes four four medium good6 yes four four large badThe partition of the 7 instances according to Efficiency is: (4,3). The entropy is: 0.985228.The MOTOR feature has two values. The value “no” creates a partition of (1,1) with entropy of1; the value “yes” creates a partition of (3,2) with an entropy if 0.970951. The probability of “no”MOTOR is 2/7, and the probability of “yes” MOTOR is 5/7. Therefore:Entropy(Efficiency|MOTOR) = 2/7 × 1 + 5/7 × 0.970951 = 0.97925Gain(Efficiency, MOTOR) = 0.985228 − 0.97925 = 0.00597758For Wheels we have 3 partitions: { “two”, (2,0) }, { “three”, (0,1) }, { “four, (2,2) }Entropy(Efficiency|Wheels) = 2/7 × 0 + 1/7 × 0 + 4/7 × 1 = 0.571429.3For Doors we have 4 partitions: { “none, (2,1) }, { “two”, (1,0) }, { “three, (1,0) }, { “four, (1,1) }Entropy(Efficiency|Doors) = 3/7 × 0.918296 + 1/7 × 0 + 1/7 × 0 + 2/7 × 1 = 0.67927For Size we have 3 partitions: { “small”, (2,2) }, { “medium, (2,0) }, { “large”, (0,1) }Entropy(Efficiency|Size) = 4/7 × 1 + 2/7 × 0 + 1/7 × 0 = 0.571429Therefore we should choose Wheels or Size as our first attribute.Choosing Wheels as the first attribute splits the data into three classes. Two of them need notbe partitioned further. The one that does is composed of instances {3,4,5,6}. Proceeding as abovewe see that the next attribute is
View Full Document