15-251: Great Theoretical Ideas in Computer ScienceNotes on ProbabilityOne should always be a little improbable. —Oscar WildeI am giddy; expectation whirls me round.Th’ imaginary relish is so sweetthat it enchants my sense. —William ShakespeareTroilus and Cressida, act 3, scene 21. Probability SpacesIn the dice and coin flipping examples we’ve looked at, the probabilistic elements are fairly intuitive.Probability is formalized in terms of sample spaces, events, and probability distributions. Thereis also a very powerful but elementary calculus for probability that allows us to compute andapproximate event probabilities.A probabilistic model or probability space is comprised of1. A sample space Ω. The sample space is thought of as the set of possible outcomes of an“experiment” appropriate for the problem being modeled.2. A probability distribution P : 2Ω→ [0, 1] is a function which assigns to every set A of possibleoutcomes a probability P(A), which is a number between zero and one. Subsets of Ω are calledevents.It is often quite helpful to think of Ω as the outcomes of some experiment, which can often bethought of as a physical process. In many cases, the “experiment” is artificial, and can simply bethought of as the roll of a (many-sided) die. In other cases, the physical process is natural to theproblem, for example in the case of arrivals to a web server.The possible outcomes of the experiment must be chosen to be non-overlapping; that is, mutuallyexclusive. On the other hand, events can be non-exclusive, since they are arbitrary subsets ofpossible outcomes.1.1. The Axioms of ProbabilityA probability distribution P for a sample space Ω is required to satisfy the following axioms.1. P(A) ≥ 0 for every event A12. If A and B are events with A ∩ B = ∅, thenP(A ∪ B) = P(A) + P(B)More generally, if {Ai}ni=1is a finite sequence of disjoint events, thenPÃn[i=1Ai!=nXi=1P(Ai)3. The probability of the sample space Ω is one: P(Ω) = 1.The most common, important and intuitive way of constructing probability models is when thesample space Ω = {ω1, ω2, . . . , ωn} is finite. In this case a probability distribution is simply a set ofnumbers pi= P(ωi) satisfying 0 ≤ pi≤ 1 andPni=1pi= 1. The probability of an event is obtainedby simply adding up the probabilities of the outcomes contained in that event; thusP({a1, a2, . . . , am}) =mXj=1P(aj) (1)Such a distribution is easily seen to satisfy the axioms.Example 1.1. Bernoulli trialsFor a sample space of size two, we can set Ω = {H, T }, where the experimental outcomes correspondto the flip of a coin coming up heads or tails. There are only four events, A = ∅, {H}, {T }, {H, T }.The probability distribution is determined by a single number, the probability of heads P(H). Thisis because of the third axiom, which requires that P(H) + P(T ) = 1.The flip of a coin is often referred to as a Bernoulli trial in probability, after Jacob Bernoulli, whostudied some of the foundations of probability theory between 1685 and 1689.Example 1.2. Multinomial probabilitiesThis is the crucial example where the experiment is a single roll of an n-sided die1It includes thecase of Bernoulli trials as a special case.Consider the case of rolling a standard die. In this case Ω = {ω1, . . . , ω6} where ωicorresponds toi dots showing on the top of the die. There are 26= 64 possible events, one being, for example,the event “the number of dots is even.” A “fair” die corresponds to the distribution P(ωi) =16foreach i = 1, . . . , 6.1While such an experiment may be difficult to execute physically, because of the difficulty of constructing such adie for large n, it is conceptually useful to think in these terms.22. Random VariablesA random variable is a numerical value associated with each experimental outcome. Thus, an r.v.can be thought of as a mappingX : Ω −→ Rfrom the sample space Ω to the real line. If you like, it can be thought of as a “measurement”associated with an experiment. Thus, random variables are really random functions; but what’srandom is the input to the function.Example 2.1. Max of a rollConsider rolling a pair of (distinct) dice, so the sample space is Ω = {(i, j) | 1 ≤ i, j ≤ 6}. LetX(i, j) = max(i, j).If X is a random variable, and S ⊂ R, defineP(X ∈ S) = P({ω | X(ω) ∈ S})= P(X−1(S))whereX−1(S) = {ω ∈ Ω | X(ω) ∈ S}.In particular,The probability mass function (pmf) of a random variable X isPX(x) = P({ω | X(ω) = x})Some of its basic properties are:• A discrete (finite) r.v. takes on only a discrete (finite) set of values.• The probability of a set S is given byPX(S) =Xx∈SPX(x)since X−1(S) =Sx∈SX−1(s).• The total probability is one:XxPX(x) = 1(by the normalization axiom).3So, we can picture a probability mass function as a “bar graph” with a bar over each of the possiblevalues of the random variable, where the sum of the heights of the bars is one.For a given random variable X, we can just work with the p.m.f. PX(x). But when we do this hidesthe sample space! We need to remember that the sample space is really there in the background.Example 2.2. Tetrahedral dieConsider the roll of a pair of (distinguishable) tetrahedral die. The sample space is Ω = {(i, j) | 1 ≤i, j ≤ 4}. Let X be the random variable X(i, j) = max(i, j).ThenPX(1) =116PX(2) =316PX(3) =516PX(4) =7163. Some Important RVsOne of the most central random variables is the Bernoulli.Example 3.1. BernoulliThe Bernoulli random variable is simply the indicator function for a coin flip:The Bernoulli random variable isX({H}) = 1 X({T }) = 0for the sample space Ω = {H, T }, with probability distribution P(H) = 1 − P(T ) = p.Thus the probability mass function of the random variable isPX(1) = 1 − PX(0) = p.4A sum of Bernoullis is called a binomial random variable. Here the sample space is the collectionof all sequences such as(HHT H · · · T )|{z }n times;that is, the set of n flips of a (biased) coin, assuming the flips are independent. Let X(ω) be therandom variableX(ω) = number of heads in the sequence ωThe pmf of a binomial random variable, for n flips of a coin, isPX(k) =µnk¶pk(1 − p)n−kwhere p is the probability of heads.Note that by the binomial theoremnXk=0PX(k) =nXk=0µnk¶pk(1 − p)k= (p + 1 − p)n= 1The Boston Museum of Science has a wonderful Galton machine, also known as a Quincunx machine(after the Roman design of five dots that you see on a
View Full Document