An introduction to graphical models Kevin P Murphy 10 May 2001 1 Introduction The following quotation from the Preface of Jor99 provides a very concise introduction to graphical models Graphical models are a marriage between probability theory and graph theory They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering uncertainty and complexity and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms Fundamental to the idea of a graphical model is the notion of modularity a complex system is built by combining simpler parts Probability theory provides the glue whereby the parts are combined ensuring that the system as a whole is consistent and providing ways to interface models to data The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general purpose algorithms Many of the classical multivariate probabalistic systems studied in fields such as statistics systems engineering information theory pattern recognition and statistical mechanics are special cases of the general graphical model formalism examples include mixture models factor analysis hidden Markov models Kalman filters and Ising models The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism This view has many advantages in particular specialized techniques that have been developed in one field can be transferred between research communities and exploited more widely Moreover the graphical model formalism provides a natural framework for the design of new systems In this paper we will flesh out this remark by discussing the following topics Representation how can a graphical model compactly represent a joint probability distribution Inference how can we efficiently infer the hidden states of a system given partial and possibly noisy observations Learning how do we estimate the parameters and structure of the model Decision theory what happens when it is time to convert beliefs into actions Applications what has this machinery been used for 1 P C F P C T 0 5 0 5 Cloudy C P S F P S T F 0 5 T 0 9 Sprinkler Rain 0 5 C F 0 1 WetGrass T P R F P R T 0 8 0 2 0 2 0 8 S R P W F P W T F F 1 0 0 0 T F 0 1 0 9 F T 0 1 0 9 T T 0 01 0 99 Figure 1 A simple Bayesian network adapted from RN95 2 Representation Probabilistic graphical models are graphs in which nodes represent random variables and the lack of arcs represent conditional independence assumptions Hence they provide a compact representation of joint probability distributions as we will see below For example if we have N binary random variables an atomic representation of the joint P X1 XN needs O 2N parameters whereas a graphical model may need exponentially fewer depending on which conditional assumptions we make This can help both inference and learning as we explain below There are two main kinds of graphical models undirected and directed Undirected graphical models also known as Markov networks or Markov random fields MRFs are more popular with the physics and vision communities Log linear models are a special case of undirected graphical models and are popular in statistics Directed graphical models also known as Bayesian networks BNs belief networks generative models causal models etc are more popular with the AI and machine learning communities Note that despite the name Bayesian networks do not necessarily imply a commitment to Bayesian methods rather they are so called because they use Bayes rule for inference as we will see below It is also possible to have a model with both directed and undirected arcs which is called a chain graph 2 1 Directed graphical models In a directed graphical model i e a Bayesian network an arc from A to B can be informally interpreted as indicating that A causes B 1 Hence directed cycles are disallowed Consider the example in Figure 1 Here nodes represent binary random variables We see that the event grass is wet W true has two possible causes either the water sprinker is on S true or it is raining R true The strength of this relationship is shown in the table below W this is called W s conditional 1 See the following books for a more detailed discussion of the use of Bayes nets for causal modelling Pea00 SGS00 GC99 Shi00 2 X Y X1 Y1 Y2 a Xn Ym b Figure 2 Factor analysis represented as a Bayesian network probability table CPT For example we see that P W true S true R f alse 0 9 second entry of second row and hence P W f alse S true R f alse 1 0 9 0 1 since each row must sum to one Since the C node has no parents its CPT specifies the prior probability that it is cloudy in this case 0 5 The simplest statement of the conditional independence relationships encoded in a Bayesian network can be stated as follows a node is independent of its ancestors given its parents where the ancestor parent relationship is with respect to some fixed topological ordering of the nodes Let us see how we can use this fact to specify the joint distribution more compactly By the chain rule of probability the joint probability of all the nodes in Figure 1 is P C S R W P C P S C P R C S P W C S R By using conditional independence relationships we can rewrite this as P C S R W P C P S C P R C P W S R where we were allowed to simplify the third term because R is independent of S given its parent C written R S C and the last term because W C S R We can see that the conditional independence relationships allow us to represent the joint more compactly Here the savings are minimal but in general if we had n binary nodes the full joint would require O 2 n parameters but the factored form would only require O n2k parameters where k is the maximum fan in of a node We will now show how many popular statistical models can be represented as directed graphical models 2 1 1 PCA ICA and all that In the above example the nodes represented discrete random variables It is also possible that they represent continuous random variables In this case we must specify the conditional probability distribution CPD of a node given its parents in parametric form A common example is a Gaussian Consider the example in Figure 2 a this represents the model P X Y P X P Y X Y is observed hence it is shown shaded but X is hidden
View Full Document