CS188 Section Handout 4: ProbabilitySean MarkanFebruary 20, 20071 NotesProbabilistic models always involve a sample space Ω, which can be thought of as the set of all possibleoutcomes of some experiment. The model also specifies a function P (ω), which gives the probability of eachoutcome ω in Ω.1Pω∈ΩP (ω), the probability of the outcome being any element of Ω, must be 1.Random variables are just functions which assign a value to each o utcome in a sample space. They areusually written with capital letters and can be thought of as measurements of outco mes. For example, if thesample space is the set of all people, one could define a random variable H which measures height. If k is aperson, H(k) is his/her height.An event is a collection of outcomes of an experiment which have some property in common. For example,if the sample s pace is the set of people, F might be the set of females. Randomly picking a person may ormay not result in the event that they’re female. We write P (F ) for the probability that the e vent occurs.If A and B are events, we can define new events A ∩ B, A ∪ B, and¯A. P (A, B) always means P (A ∩ B).The conditional probability of event A occurring given that event B occurs, written P (A|B), isP (A ∩ B)P (B). (1)If X is a random variable and x a value, consider the event that the measurement X, pe rformed on anoutcome of a random experiment, equals x. This event is written X = x. We can write P (X = x) for theprobability that X = x occurs. Often p eople abbreviate this to P (X).2P (X = x), viewed as a function ofx, is called the distribution of X.A joint distribution of multiple random variables is a function giving the probability that they take onparticular co mbinations of values . Example: P (X = x, Y = y).A conditional distribution is a function giving a conditional probability where the e vents are equalitytests of random variables. Example: P (X = x|Y = y).Often people will define a probabilistic model by simply listing some random variables and providing a fulljoint distribution over all of them. (What a re the outcomes a nd the function P (ω) in that case?) Tocalculate smaller joint distributions it then becomes necessary to marginalize over unused variables. Forexample, if the full joint is P (X = x, Y = y, Z = z), then we can compute a smaller joint distributionP (X = x, Y = y) as follows:P (X = x, Y = y) =XzP (X = x, Y = y, Z = z). (2)1This is technically a lie. P is usually defined as a function whose domain is a subset of the powerset of Ω, and where P (A)represents the probability that the outcome falls i n A. The probabilistic model is a measure space with measure P . For adiscrete Ω, this technicality may be ignored. We assume a discrete Ω f or the rest of this introduction, though all the resultsgeneralize.2Be sure you understand the difference between X and x: X is a name for something about outcomes that we can measure.x is a variable which ranges over possible results of that measurement.1People call such a distribution a marginal distribution to empha size that it was computed by marginaliza-tion.3The words marginal, conditional, and joint don’t refer to disjoint categories. P (X, Y |Z) could be the jointprobability of X and Y , conditioned on Z, marginalized over some other variable W .Chain rule: P (A, B) = P (A|B)P (B)Bayes’ rule: P (A|B) =P (B|A)P (A)P (B)Events A and B are called independent if P (A, B) = P (A)P (B). This is equivalent to P (A|B) = P (A).Random variables X and Y are called independent if P (X, Y ) = P (X)P (Y ). This is equivalent toP (X|Y ) = P (X).3The word marginalization derives from the pre-spreadsheet accountant’s practice of writing sums of columns or rows of atable in the
View Full Document