Unformatted text preview:

8 Inference8.1 Estimation8.1.1 Symmetric Binary Channel8.1.2 Non-symmetric Binary Channel8.1.3 Berger's Burgers8.1.4 Inference Strategy8.2 Principle of Maximum Entropy: Simple Form8.2.1 Berger's Burgers8.2.2 Probabilities8.2.3 Entropy8.2.4 Constraints8.2.5 Maximum Entropy, Analytic Form8.2.6 SummaryChapter 8InferenceIn Chapter 7 the process model was introduced as a way of accounting for flow of information throughprocesses that are discrete, finite, and memoryless, and which may be nondeterministic and lossy. Althoughmotivated by the way many communication systems work, the model is more general.Formulas were given for input information I, loss L, mutual information M, noise N , and output in-formation J. Each of these is measured in bits, although in a setting in which many symbols are chosen,one after another, they may be multiplied by the rate of symbol selection and then expressed in bits persecond. The information flow is shown in Figure 8.1. All these quantities depend on the input probabilitydistribution p(Ai).If the input probabilities are already known, and a particular output outcome is observed, it is possible tomake inferences about which input event led to that outcome. Sometimes the input event can be identifiedwith certainty, but more often the inferences are in the form of changes in the initial input probabilities.This is typically how communication systems work—the output is observed and the “most probable” inputevent is inferred. Inference in this context is sometime referred to as estimation. It is the topic of Section8.1.However, if the input probabilities are not known, this approach does not work. We need a way to getthe initial probability distribution. An approach that is based on the information analysis is discussed inSection 8.2 and in subsequent chapters of these notes. This is the Principle of Maximum Entropy.8.1 EstimationIt is often necessary to determine the input event when only the output event has been observed. Thisis the case for communication systems, in which the objective is to infer the symbol emitted by the sourceso that it can be reproduced at the output. It is also the case for memory systems, in which the objectiveis to recreate the original bit pattern without error.In principle, this estimation is straightforward if the input probability distribution p(Ai) and the condi-tional output probabilities, conditioned on the input events, p(Bj| Ai) = cji, are known. These “forward”conditional probabilities cjiform a matrix with as many rows as there are output events, and as manycolumns as there are input events. They are a property of the process, and do not depend on the inputprobabilities p(Ai).The unconditional probability p(Bj) of each output event BjisAuthor: Paul Penfield, Jr.This document: http://www.mtl.mit.edu/Courses/6.050/2010/notes/chapter8.pdfVersion 1.6, March 28, 2010. Copyrightc 2010 Massachusetts Institute of TechnologyStart of notes · back · next | 6.050J/2.110J home page | Site map | Search | About this document | Comments and inquiries958.1 Estimation 96IinputJoutputLlossMNnoiseFigure 8.1: Information flow in a discrete memoryless processp(Bj) =Xicjip(Ai) (8.1)and the joint probability of each input with each output p(Ai, Bj) and the backward conditional probabilitiesp(Ai| Bj) can be found using Bayes’ Theorem:p(Ai, Bj) = p(Bj)p(Ai| Bj)= p(Ai)p(Bj| Ai)= p(Ai)cji(8.2)Now let us suppose that a particular output event Bjhas been observed. The input event that “caused”this output can be estimated only to the extent of giving a probability distribution over the input events.For each input event Aithe probability that it was the input is simply the backward conditional probabilityp(Ai| Bj) for the particular output event Bj, which can be written using Equation 8.2 asp(Ai| Bj) =p(Ai)cjip(Bj)(8.3)If the process has no loss (L = 0) then for each j exactly one of the input events Aihas nonzero probability,and therefore its probability p(Ai| Bj) is 1. In the more general case, with nonzero loss, estimation consistsof refining a set of input probabilities so they are consistent with the known output. Note that this approachonly works if the original input probability distribution is known. All it does is refine that distribution inthe light of new knowledge, namely the observed output.It might be thought that the new input probability distribution would have less uncertainty than that ofthe original distribution. In other words, it seems reasonable that when you learn something, your uncertaintywould decrease. But is this always true?The uncertainty of a probability distribution is, of course, its entropy as defined earlier. The uncertainty(about the input event), before the output event is known, isUbefore=Xip(Ai) log21p(Ai)(8.4)The residual uncertainty, after some particular output event is known, isUafter(Bj) =Xip(Ai| Bj) log21p(Ai| Bj)(8.5)8.1 Estimation 97------(1)(1)0101(a) (b)------>ZZZZZZ~1 − ε1 − εεεi=0i=1j=0j=1Figure 8.2: (a) Binary Channel without noise or loss; (b) Symmetric Binary Channel, with errorsThe question, then, is whether Uafter(Bj) ≤ Ubefore. The answer is often, but not always, yes. However, itis not difficult to prove that the average (over all output states) of the residual uncertainty is less than theoriginal uncertainty:Xjp(Bj)Uafter(Bj) ≤ Ubefore(8.6)In words, this statement says that on average, our uncertainty about the input state is never increased bylearning something about the output state. In other words, on average, this technique of inference helps usget a better estimate of the input state.Two of the following examples will be continued in subsequent chapters including the next chapter onthe Principle of Maximum Entropy—the symmetric binary channel and Berger’s Burgers.8.1.1 Symmetric Binary ChannelThe noiseless, lossless binary channel shown in Figure 8.2(a) is a process with two input values whichmay be called 0 and 1, two output values similarly named, and a transition matrix cjiwhich guarantees thatthe output equals the input:c00c01c10c11=1 00 1(8.7)This channel has no loss and no noise, and the mutual information, input information, and outputinformation are all the same.The symmetric binary channel (Figure 8.2(b)) is similar, but occasionally makes errors. Thus if the inputis 1 the output is not always 1, but with the “bit error probability” ε is flipped to the “wrong” value 0, andhence is


View Full Document

MIT 6 050J - Inference

Download Inference
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Inference and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Inference 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?