Unformatted text preview:

Information TheoryOutlineInformationDefinition of InformationInformation is AdditiveSlide 6Slide 7Slide 8Slide 9EntropySlide 11Explanation of EntropyProperties of EntropyEntropy: k = 2The Entropy of EnglishSlide 16Entropy of Two SourcesJoint EntropyConditional EntropyMutual InformationRelationshipA Distance Measure Between DistributionsBregman DistanceCompression Algorithm for TCSlide 25The Noisy ChannelNoisy Channel ApplicationsInformation TheoryRong JinOutlineInformationEntropyMutual informationNoisy channel modelInformationInformation  knowledgeInformation: reduction in uncertaintyExample:1. flip a coin2. roll a die#2 is more uncertain than #1Therefore, more information is provided by the outcome of #2 than #1Definition of InformationLet E be some event that occurs with probability P(E). If we are told that E has occurred, then we say we have received I(E)=log2(1/P(E)) bits of informationExample:Result of a fair coin flip (log22=1 bit)Result of a fair die roll (log26=2.585 bits)Information is AdditiveI(k fair coin tosses) = log2k =k bitsExample: information conveyed by wordsRandom word from a 100,000 word vocabulary:I(word) = log(100,000) = 16.6 bitsA 1000 word document from the same sourceI(document) = 16,600 bitsA 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits A picture is worth a 1000 words!Information is AdditiveI(k fair coin tosses) = log2k =k bitsExample: information conveyed by wordsRandom word from a 100,000 word vocabulary:I(word) = log(100,000) = 16.6 bitsA 1000 word document from the same sourceI(document) = 16,600 bitsA 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits A picture is worth a 1000 words!Information is AdditiveI(k fair coin tosses) = log2k =k bitsExample: information conveyed by wordsRandom word from a 100,000 word vocabulary:I(word) = log(100,000) = 16.6 bitsA 1000 word document from the same sourceI(document) = 16,600 bitsA 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits A picture is worth a 1000 words!Information is AdditiveI(k fair coin tosses) = log2k =k bitsExample: information conveyed by wordsRandom word from a 100,000 word vocabulary:I(word) = log(100,000) = 16.6 bitsA 1000 word document from the same sourceI(document) = 16,600 bitsA 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits A picture is worth more than a 1000 words!OutlineInformationEntropyMutual InformationCross Entropy and LearningEntropyA zero-memory information source S is a source that emits symbols from an alphabet {s1, s2,…, sk} with probability {p1, p2,…,pk}, respectively, where the symbols emitted are statistically independent. What is the average amount of information in observing the output of the source S?Call this entropy:( )~1 1( ) ( ) log [log ]( )i i i p s Pii iH s p I s p Ep p s= = =� �EntropyA zero-memory information source S is a source that emits symbols from an alphabet {s1, s2,…, sk} with probability {p1, p2,…,pk}, respectively, where the symbols emitted are statistically independent. What is the average amount of information in observing the output of the source S?Call this entropy:( )~1 1( ) ( ) log [log ]( )i i i p s Pii iH s p I s p Ep p s= = =� �Explanation of Entropy1( ) logiiiH P pp=�1. Average amount of information provided per symbol2. Average # of bits needed to communicate each symbolProperties of Entropy1. Non-negative: H(P) 02. For any other probability distribution {q1,…,qk},3. H(P)  logk, with equality iff pi=1/k for all i4. The further P is from uniform, the lower the entropy.1( ) logiiiH P pp=�1 1( ) log logi ii ii iH P p pp q= �� �Entropy: k = 21 1( ) log (1 )log1H P p pp p= + --0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100.10.20.30.40.50.60.7Notice• zero information at edges• maximum information at 0.5 (1 bit)• drop off more quickly close edges than in the middleThe Entropy of English27 characters (A-Z, space)100,000 words (average 6.5 char each)Assuming independence between successive characters:Uniform character distribution: log27 = 4.75 bits/charTrue character distribution: 4.03 bits/characterAssuming independence between successive words:Uniform word distribution: log100,1000/6.5 = 2.55 bits/charTrue word distribution: 9.45/6.5 = 1.45 bits/characterTrue entropy of English is much lower!The Entropy of English27 characters (A-Z, space)100,000 words (average 6.5 char each)Assuming independence between successive characters:Uniform character distribution: log27 = 4.75 bits/charTrue character distribution: 4.03 bits/characterAssuming independence between successive words:Uniform word distribution: (log100,1000)/6.5 = 2.55 bits/charTrue word distribution: 9.45/6.5 = 1.45 bits/characterTrue entropy of English is much lower!Entropy of Two SourcesTemperature TP(T = hot) = 0.3P(T = mild) = 0.5P(T = cold) = 0.2 H(T) = H(0.3, 0.5, 0.2) = 1.485Humidity MP(M = low) = 0.6P(M = high) = 0.4 H(M) = H(0.6, 0.4) = 0.971Random variable T, M are not independent• P(T=t, M=m)P(T=t)P(M=m)• H(T) = 1.485• H(M) = 0.971• H(T) + H(M) = 2.456• Joint Entropy• H(T, M) = H(0.1, 0.4, 0.1, 0.2, 0.1, 0.1, 0.1) = 2.321• H(T, M) H(T) + H(M)Joint EntropyJoint Probability P(T, M)Conditional EntropyConditional EntropyH(T|M = low) = 1.252H(T|M = high) = 1.5Average conditional entropyHow much is M telling us on average about T?H(T) – H(T|M) = 1.485 – 1.351 = 0.134 bits( | ) ( ) ( | )0.4 1.251 0.6 1.5 1.351mH T M P M m H T M m= = == � + � =�Conditional Probability P(T| M)Mutual InformationProperties:Indicate the amount of information one random variable can provide to another oneSymmetric I(X;Y) = I(Y;X)Non-negativeZero iff X, Y are independent,,( ; ) ( ) ( | )1 1( )log ( , )log( ) ( | )( , )( , )log( ) ( )x x yx yI X Y H X H X YP x P x yP x P x yP x yP x yP x P y= -= -=� ��RelationshipH(X, Y)H(X)H(Y)H(X|Y)H(Y|X)I(X;Y)A Distance Measure Between DistributionsKullback-Leibler distance:Properties of Kullback-Leibler distanceNon-negative: KL(PD||PM)=0 iff PD= PMMinimizing KL distance  PM get close to PDNon-symmetric: KL(PD||PM) KL(PM||PD)~( ) ( )( || ) ( )log [log ]( ) (


View Full Document

MSU CSE 847 - Information Theory

Course: Cse 847-
Pages: 27
Download Information Theory
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Information Theory and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Information Theory 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?