Information TheoryOutlineInformationDefinition of InformationInformation is AdditiveSlide 6Slide 7Slide 8Slide 9EntropySlide 11Explanation of EntropyProperties of EntropyEntropy: k = 2The Entropy of EnglishSlide 16Entropy of Two SourcesJoint EntropyConditional EntropyMutual InformationRelationshipA Distance Measure Between DistributionsBregman DistanceCompression Algorithm for TCSlide 25The Noisy ChannelNoisy Channel ApplicationsInformation TheoryRong JinOutlineInformationEntropyMutual informationNoisy channel modelInformationInformation knowledgeInformation: reduction in uncertaintyExample:1. flip a coin2. roll a die#2 is more uncertain than #1Therefore, more information is provided by the outcome of #2 than #1Definition of InformationLet E be some event that occurs with probability P(E). If we are told that E has occurred, then we say we have received I(E)=log2(1/P(E)) bits of informationExample:Result of a fair coin flip (log22=1 bit)Result of a fair die roll (log26=2.585 bits)Information is AdditiveI(k fair coin tosses) = log2k =k bitsExample: information conveyed by wordsRandom word from a 100,000 word vocabulary:I(word) = log(100,000) = 16.6 bitsA 1000 word document from the same sourceI(document) = 16,600 bitsA 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits A picture is worth a 1000 words!Information is AdditiveI(k fair coin tosses) = log2k =k bitsExample: information conveyed by wordsRandom word from a 100,000 word vocabulary:I(word) = log(100,000) = 16.6 bitsA 1000 word document from the same sourceI(document) = 16,600 bitsA 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits A picture is worth a 1000 words!Information is AdditiveI(k fair coin tosses) = log2k =k bitsExample: information conveyed by wordsRandom word from a 100,000 word vocabulary:I(word) = log(100,000) = 16.6 bitsA 1000 word document from the same sourceI(document) = 16,600 bitsA 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits A picture is worth a 1000 words!Information is AdditiveI(k fair coin tosses) = log2k =k bitsExample: information conveyed by wordsRandom word from a 100,000 word vocabulary:I(word) = log(100,000) = 16.6 bitsA 1000 word document from the same sourceI(document) = 16,600 bitsA 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits A picture is worth more than a 1000 words!OutlineInformationEntropyMutual InformationCross Entropy and LearningEntropyA zero-memory information source S is a source that emits symbols from an alphabet {s1, s2,…, sk} with probability {p1, p2,…,pk}, respectively, where the symbols emitted are statistically independent. What is the average amount of information in observing the output of the source S?Call this entropy:( )~1 1( ) ( ) log [log ]( )i i i p s Pii iH s p I s p Ep p s= = =� �EntropyA zero-memory information source S is a source that emits symbols from an alphabet {s1, s2,…, sk} with probability {p1, p2,…,pk}, respectively, where the symbols emitted are statistically independent. What is the average amount of information in observing the output of the source S?Call this entropy:( )~1 1( ) ( ) log [log ]( )i i i p s Pii iH s p I s p Ep p s= = =� �Explanation of Entropy1( ) logiiiH P pp=�1. Average amount of information provided per symbol2. Average # of bits needed to communicate each symbolProperties of Entropy1. Non-negative: H(P) 02. For any other probability distribution {q1,…,qk},3. H(P) logk, with equality iff pi=1/k for all i4. The further P is from uniform, the lower the entropy.1( ) logiiiH P pp=�1 1( ) log logi ii ii iH P p pp q= �� �Entropy: k = 21 1( ) log (1 )log1H P p pp p= + --0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100.10.20.30.40.50.60.7Notice• zero information at edges• maximum information at 0.5 (1 bit)• drop off more quickly close edges than in the middleThe Entropy of English27 characters (A-Z, space)100,000 words (average 6.5 char each)Assuming independence between successive characters:Uniform character distribution: log27 = 4.75 bits/charTrue character distribution: 4.03 bits/characterAssuming independence between successive words:Uniform word distribution: log100,1000/6.5 = 2.55 bits/charTrue word distribution: 9.45/6.5 = 1.45 bits/characterTrue entropy of English is much lower!The Entropy of English27 characters (A-Z, space)100,000 words (average 6.5 char each)Assuming independence between successive characters:Uniform character distribution: log27 = 4.75 bits/charTrue character distribution: 4.03 bits/characterAssuming independence between successive words:Uniform word distribution: (log100,1000)/6.5 = 2.55 bits/charTrue word distribution: 9.45/6.5 = 1.45 bits/characterTrue entropy of English is much lower!Entropy of Two SourcesTemperature TP(T = hot) = 0.3P(T = mild) = 0.5P(T = cold) = 0.2 H(T) = H(0.3, 0.5, 0.2) = 1.485Humidity MP(M = low) = 0.6P(M = high) = 0.4 H(M) = H(0.6, 0.4) = 0.971Random variable T, M are not independent• P(T=t, M=m)P(T=t)P(M=m)• H(T) = 1.485• H(M) = 0.971• H(T) + H(M) = 2.456• Joint Entropy• H(T, M) = H(0.1, 0.4, 0.1, 0.2, 0.1, 0.1, 0.1) = 2.321• H(T, M) H(T) + H(M)Joint EntropyJoint Probability P(T, M)Conditional EntropyConditional EntropyH(T|M = low) = 1.252H(T|M = high) = 1.5Average conditional entropyHow much is M telling us on average about T?H(T) – H(T|M) = 1.485 – 1.351 = 0.134 bits( | ) ( ) ( | )0.4 1.251 0.6 1.5 1.351mH T M P M m H T M m= = == � + � =�Conditional Probability P(T| M)Mutual InformationProperties:Indicate the amount of information one random variable can provide to another oneSymmetric I(X;Y) = I(Y;X)Non-negativeZero iff X, Y are independent,,( ; ) ( ) ( | )1 1( )log ( , )log( ) ( | )( , )( , )log( ) ( )x x yx yI X Y H X H X YP x P x yP x P x yP x yP x yP x P y= -= -=� ��RelationshipH(X, Y)H(X)H(Y)H(X|Y)H(Y|X)I(X;Y)A Distance Measure Between DistributionsKullback-Leibler distance:Properties of Kullback-Leibler distanceNon-negative: KL(PD||PM)=0 iff PD= PMMinimizing KL distance PM get close to PDNon-symmetric: KL(PD||PM) KL(PM||PD)~( ) ( )( || ) ( )log [log ]( ) (
View Full Document