MSU CSE 847 - Information Theory - D387358

Home> Schools> Michigan State University> Computer Science & Engineering (CSE) > CSE 847> Information Theory

MSU CSE 847 - Information Theory

Pages 27

Download Save

Unformatted text preview:

Information Theory Rong Jin Outline Information Entropy Mutual information Noisy channel model Information Information knowledge Information reduction in uncertainty Example 1 2 flip a coin roll a die 2 is more uncertain than 1 Therefore more information is provided by the outcome of 2 than 1 Definition of Information Let E be some event that occurs with probability P E If we are told that E has occurred then we say we have received I E log2 1 P E bits of information Example Result of a fair coin flip log22 1 bit Result of a fair die roll log26 2 585 bits Information is Additive I k fair coin tosses log2k k bits Example information conveyed by words Random word from a 100 000 word vocabulary A 1000 word document from the same source I word log 100 000 16 6 bits I document 16 600 bits A 480x640 pixel 16 greyscale video picture I picture 307 200 log16 1 228 800 bits A picture is worth a 1000 words Information is Additive I k fair coin tosses log2k k bits Example information conveyed by words Random word from a 100 000 word vocabulary A 1000 word document from the same source I word log 100 000 16 6 bits I document 16 600 bits A 480x640 pixel 16 greyscale video picture I picture 307 200 log16 1 228 800 bits A picture is worth a 1000 words Information is Additive I k fair coin tosses log2k k bits Example information conveyed by words Random word from a 100 000 word vocabulary A 1000 word document from the same source I word log 100 000 16 6 bits I document 16 600 bits A 480x640 pixel 16 greyscale video picture I picture 307 200 log16 1 228 800 bits A picture is worth a 1000 words Information is Additive I k fair coin tosses log2k k bits Example information conveyed by words Random word from a 100 000 word vocabulary A 1000 word document from the same source I word log 100 000 16 6 bits I document 16 600 bits A 480x640 pixel 16 greyscale video picture I picture 307 200 log16 1 228 800 bits A picture is worth more than a 1000 words Outline Information Entropy Mutual Information Cross Entropy and Learning Entropy A zero memory information source S is a source that emits symbols from an alphabet s1 s2 sk with probability p1 p2 pk respectively where the symbols emitted are statistically independent What is the average amount of information in observing the output of the source S Call this entropy 1 1 H s pi I si pi log E p s P log pi p s i i Entropy A zero memory information source S is a source that emits symbols from an alphabet s1 s2 sk with probability p1 p2 pk respectively where the symbols emitted are statistically independent What is the average amount of information in observing the output of the source S Call this entropy 1 1 H s pi I si pi log E p s P log pi p s i i Explanation of Entropy 1 H P pi log pi i 1 Average amount of information provided per symbol 2 Average of bits needed to communicate each symbol Properties of Entropy 1 2 1 H P pi log pi i Non negative H P 0 For any other probability distribution q1 qk H P pi log i 1 1 pi log pi qi i 3 H P logk with equality iff pi 1 k for all i 4 The further P is from uniform the lower the entropy Entropy k 2 0 7 0 6 1 1 H P p log 1 p log p 1 p 0 5 0 4 0 3 Notice zero information at edges maximum information at 0 5 1 bit drop off more quickly close edges than in the middle 0 2 0 1 0 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 1 The Entropy of English 27 characters A Z space 100 000 words average 6 5 char each Assuming independence between successive characters Assuming independence between successive words Uniform character distribution log27 4 75 bits char True character distribution 4 03 bits character Uniform word distribution log100 1000 6 5 2 55 bits char True word distribution 9 45 6 5 1 45 bits character True entropy of English is much lower The Entropy of English 27 characters A Z space 100 000 words average 6 5 char each Assuming independence between successive characters Assuming independence between successive words Uniform character distribution log27 4 75 bits char True character distribution 4 03 bits character Uniform word distribution log100 1000 6 5 2 55 bits char True word distribution 9 45 6 5 1 45 bits character True entropy of English is much lower Entropy of Two Sources Temperature T Humidity M P T hot 0 3 P M low 0 6 P T mild 0 5 P M high 0 4 P T cold 0 2 H M H 0 6 0 4 0 971 H T H 0 3 0 5 0 2 1 485 Random variable T M are not independent P T t M m P T t P M m Joint Entropy H T 1 485 Joint Probability P T M H M 0 971 H T H M 2 456 Joint Entropy H T M H 0 1 0 4 0 1 0 2 0 1 0 1 0 1 2 321 H T M H T H M Conditional Entropy Conditional Entropy H T M low 1 252 H T M high 1 5 Average conditional entropy Conditional Probability P T M H T M m P M m H T M m 0 4 1 251 0 6 1 5 1 351 How much is M telling us on average about T H T H T M 1 485 1 351 0 134 bits Mutual Information I X Y H X H X Y 1 1 x P x log x y P x y log P x P x y P x y x y P x y log P x P y Properties Indicate the amount of information one random variable can provide to another one Symmetric I X Y I Y X Non negative Zero iff X Y are independent Relationship H X Y H X H Y H X Y I X Y H Y X A Distance Measure Between Distributions Kullback Leibler distance PD x PD x KL PD PM x PD x log E x PD log PM x PM x Properties of Kullback Leibler distance Non negative KL PD PM 0 iff PD PM Minimizing KL distance PM get close to PD Non symmetric KL PD PM KL PM PD Bregman Distance x is a convex function Compression Algorithm for TC Training Examples Compress Politics 109K Sports 116K New Document Compression Algorithm for TC Training Examples Politics Compress Compress New Document 109K Politics 129K Topic New Document Sports 116K Sports Sports 126K The Noisy …

View Full Document


School:
Email:
New Password:
Confirm Password:

MSU CSE 847 - Information Theory

Sign up for free to view:

Please select your school