U of M PSY 5038 - Probability, Energy and the Boltzmann machine

Unformatted text preview:

Introduction to Neural NetworksProbability, Energy & the Boltzmann machineInitialize‡Spell check off. Plots small.In[43]:=Off@General::spell1D;SetOptions@Plot, ImageSize Ø SmallD;SetOptions@ArrayPlot,ColorFunction Ø Hue, Mesh Ø True, ImageSize Ø TinyD;Introduction‡The past two lecturesLearning as searching for weights that minimize prediction errors (statistical regression & Widrow-Hoff)Network dynamics as minimizing “energy” (Hopfield)Common theme: analysis of objective functions (e.g. sum of squared differences in regression, or energy in network dynamics) can provide useful insights into neural networks.‡TodayDeepen and broaden our conceptual tools by establishing a relationship between “energy” and probability. This will provide an important transition in the development of neural network models and in the material in this course. ‡Preview of futureBy treating neural network learning and dynamics in terms of probability computations, we’ll begin to see how a common set of tools and concepts can be applied to:1. Inference (as in perception and memory recall): as a process which makes the best guesses given data. What it means to be a “best” is specified in terms of compuations on probability distributions (and later in the course, utility functions).2. Learning: a process that discovers the parameters of a probability distribution3. Generative modeling: a process that generates data from a probability distributionStatistical physics, computation, and statistical inferenceAt the beginning of this course, we noted that John von Neumann, one of the principal minds behind the architecture of the modern digital computer, wrote that brain theory and theories of computation would eventually come to more resemble statistical mechanics or thermodynamics than formal logic. We have already seen in the Hopfield net, the development of the analogy between statistical physics systems and neural networks. The relationship between computation and statistical physics was subsequently studied by a number of physicists (cf. Hertz et al., 1991). We are going to look at a neural network model that exploits the relationship between thermodynamics and computation both to find global minima and to modify weights. Further, we will see how relating energy to probability leads naturally to statistical inference theory. Much of the current research in neural network theory is done in the context of statistical inference (Bishop, 1995; Ripley, 1996; MacKay, 2003).Probability PreliminariesWe'll go over the basics of probability theory in more detail later in a later lecture. Today we'll review enough to see how energy functions can be related to probability, and energy minimization to maximizing probability.Random variables, discrete probabilities, probability densities, cumulative distributions‡Discrete distributions: random variable X can take on a finite set of discrete valuesX = {x(1),...,x(N)]‚i=1Npi=‚i=1NpHX = xHiLL= 1‡Continuous densities: X takes on continuous values, x, in some range.Density: p(x) Analogous to material mass,we can think of the probability over some small domain of the random variable as "probability mass":probHx < X < dx + xL=‡XdX+XpHxL„ xprobHx < X < dx + xL> pHxLdxBy analogy with discrete distributions, the area under pHxLmust be unity :‡-¶¶pHxL„ x = 1like an object always weighing 1.Cumulative distribution:2 Lect_20_Boltzmann.nbprob HX < xL=‡-¶xpHXL„ X‡Densities of discrete random variablesThe Dirac Delta function, d[•], allows us to use the mathematics of continuous distributions for discrete ones, by defining the density as:p[x]=⁄i=1Npid[x - x[i]], where d[x - x[i]] =:¶0for x = x@iDfor x ¹≠ x@iDThink of the delta function, d[•], as e wide and 1/e tall, and then let e -> 0, so that:‡-¶¶dHyL„ y = 1The above density, p[x], is a series of spikes. It is infinitely high only at those points for which x = x[i], and zero elsewhere. But "infinity" is scaled so that the local mass or area around each point x[i], is pi.Check out Mathematica's functions: DiracDelta, KroneckerDelta. What is the relationship of KroneckerDelta to IdentityMatrix?‡Joint probabilitiesProb HX AND YL= pH X, YLJoint density : pHx, yLTwo events, X and Y, are said to be independent if the probability of their occurring together (i.e. their "joint probability") is equal to the product of their probabilities:Prob (X, Y)=p(X) p(Y)If a and b are independent, what is the conditional probability of a given b? The intuition is that knowledge of b provides no help with making statistical decisions about a.‡Three basic rules of probabilitySuppose we know everything there is to know about a set of variables (A,B,C,D,E). What does this mean in terms of probability? It means that we know the joint distribution, p(A,B,C,D,E). In other words, for any particular combination of values (A=a,B=b, C=c, D=d,E=e), we can calculate, look up in a table, or determine some way or another the number p(A=a,B=b, C=c, D=d,E=e).Deterministic relationships are special cases. ‡Rule 1: Conditional probabilities from joints: The product ruleProbability about an event changes when new information is gained.Prob(X given Y) = p(X|Y)pHX YL=pHX, YLpHYLpHX, YL= pHX YLpHYLThe form of the product rule is the same for densities as for probabilities.Lect_20_Boltzmann.nb 3‡Rule 2: Lower dimensional probabilities from joints: The sum rule (marginalization)pHXL=‚i=1NpHX, YHiLLpHxL=‡-¶¶pHx, yL„ x‡Rule 3: Bayes' ruleFrom the product rule, and since p[X,Y] = p[Y,X], we have:pHY XL=pHX YLpHYLpHXL, and using the sum rule, pHY XL=pHX YLpHYL⁄YpHX, YL‡Preview1. Inference: a process which makes guesses given data. Optimal inference makes the best guess according to some criterion.e.g. given data X=x, what value of Y produces the biggest p(Y|X=x)? I.e. the most probable value of Y, given X = x?2. Learning: a process that discovers the parameters of a probability distributionSupervised learning: Given training pairs {Xi, Yi}, what is p(X,Y)? Unsupervised learning: Given data {Xi}, what is p(X)?3. Generative modeling: a process that generates data from a probability distributionGiven p(X), produce sample data {Xi}.Below we introduce the Boltzmann machine as a historical example of how one can do accomplish all three processes within one network architecture. Probability and energy‡Probabilities of hypotheses contingent on data, Bayes' ruleConditional probabilities are


View Full Document
Download Probability, Energy and the Boltzmann machine
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Probability, Energy and the Boltzmann machine and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Probability, Energy and the Boltzmann machine 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?