Unformatted text preview:

Introduction to Neural NetworksU. Minn. Psy 5038More probability‡Initialize standard library files:Off@General::spell1D;SetOptions@ContourPlot, ImageSize Ø SmallD;SetOptions@Plot, ImageSize Ø SmallD;SetOptions@ListPlot, ImageSize Ø SmallD;GoalsReview the basics of probability distributions and statisticsMore on generative modeling: drawing samplesGraphical models for inferenceOptimal inference and Task dependenceProbability overviewRandom variables, discrete probabilities, probability densities, cumulative distributions‡Discrete: random variable X can take on a finite set of discrete valuesX = {x(1),...,x(N)]‚i=1Npi=‚i=1NpHX = xHiLL= 1‡Densities: X takes on continuous values, x, in some range.Density : pHxLAnalogous to material mass,we can think of the probability over some small domain of the random variable as " probability mass " :probHx < X < dx + xL=‡XdX+XpHxL„ xprobHx < X < dx + xL> pHxLdxWith the mass analogy, however, an object Hevent spaceLalways " weighs " 1 :‡-¶¶pHxL„ x = 1Cumulative distribution:prob HX < xL=‡-¶xpHXL„ X‡Densities of discrete random variablesThe Dirac Delta function, d[•], allows us to use the mathematics of continuous distributions for discrete ones, by defining the density as:p[x]=⁄i=1Npid[x - x[i]], where d[x - x[i]] =:¶0for x = x@iDfor x ¹≠ x@iDThink of the delta function, d[•], as e wide and 1/e tall, and then let e -> 0, so that:‡-¶¶dHyL„ y = 1The density, p[x], is a series of spikes. It is infinitely high only at those points for which x = x[i], and zero elsewhere. But "infinity" is scaled so that the local mass or area around each point x[i], is pi.‡Joint probabilitiesProb HX AND YL= pH X, YLJoint density : pHx, yL2 Lect_22_Probability.nbThree basic rules of probabilitySuppose we know everything there is to know about a set of variables (A,B,C,D,E). What does this mean in terms of probability? It means that we know the joint distribution, p(A,B,C,D,E). In other words, for any particular combination of values (A=a,B=b, C=c, D=d,E=e), we can calculate, look up in a table, or determine some way or another the number p(A=a,B=b, C=c, D=d,E=e), for any particular instances, a, b, c, d, e.‡Rule 1: Conditional probabilities from joints: The product ruleProbability about an event changes when new information is gained.Prob(X given Y) = p(X|Y)pHX YL=pHX, YLpHYLpHX, YL= pHX YLpHYLThe form of the product rule is the same for densities as for probabilities.‡Rule 2: Lower dimensional probabilities from joints: The sum rule (marginalization)pHXL=‚i=1NpHX, YHiLLpHxL=‡-¶¶pHx, yL„ x‡Rule 3: Bayes' ruleFrom the product rule, and since p[X,Y] = p[Y,X], we have:pHY XL=pHX YLpHYLpHXL, and using the sum rule, pHY XL=pHX YLpHYL⁄YpHX, YLLect_22_Probability.nb 3‡Bayes Terminology in inferenceSuppose we have some partial data (see half of someone's face), and we want to recall or complete the whole. Or suppose that we hear a voice, and from that visualize the face. These are both problems of statistical inference. We've already studied how to complete a partial pattern using energy minimization, and how energy minimization can be viewed as probability maximization.We typically think of the Y term as a random variable over the hypothesis space (a face), and X as data or a stimulus (partial face, or sound). So for recalling a pattern Y from an input stimulus X, We'd like to have a function that tells us:p(Y | X) which is called the posterior probability of the hypothesis (e.g. description of the full face as output) given the stimulus (partial face as "data").-- i.e. what you get when you condition the joint by the probability of the stimulus data. The posterior is often what we'd like to base our decisions on, because it can be proved that picking the hypothesis Y which maximizes the posterior (i.e. maximum a posteriori or MAP estimation) minimizes the average probability of error.p(Y) is the prior probability of the hypothesis. Some hypotheses are "a priori" more likely than others. But even if it isn't made explicit, a model prior implicitly assumes conditions. Given a context, such as your room, some faces are more likely than others. For me an image patch stimulating my retina in my kitchen is much more likely to be my wife's than my brother's (who lives in another state). Priors are contingent, i.e. conditional on context, p(Y| context), even if the context is not made explicitly.p(X|Y) is the likelihood of the hypothesis. Note this is a probability of X, but not of Y.(The sum over X is one, but the sum over Y isn't necessarily one.)‡IndependenceKnowledge of one event doesn't change the probability of another event. p(X)=p(X|Y) which by the product rule is:p(X,Y)=p(X)p(Y)Deterministic relationshipsDeterministic relationships can be treated as special cases, and provide a useful way to build some intuitions about probabil-ities.For example, suppose we know that Y = X^2 exactly, for integer values of 0<X < 5. What is the probability of X = x, Y=y, over the space of possible x's?In[38]:=p@y_, x_D := If@0 < x < 5, KroneckerDelta@y - x^2Dê 4, NullD4 Lect_22_Probability.nbIn[39]:=Table@p@y, xD, 8x, 1, 4<, 8y, 1, 16, 1<D êê MatrixFormOut[39]//MatrixForm=140 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0140 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0140 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 014Given X=x, what is y?In[193]:=Manipulate@ListPlot@Table@p@y, xD, 8y, 1, 16<D, Filling Ø Axis,ImageSize Ø Small, Axes Ø 8True, True<, PlotRange Ø 88-4, 20<, 80, .25<<,AxesLabel Ø 8"y", "p"<D, 8x, 1, 16, 1<DOut[193]=xNote that we've plotted p(y, X=2). What is p(y | X = 2)? What is p(y)?In[198]:=py@y_D := Sum@p@y, xD, 8x, 1, 4, 1<DIn[202]:=py@4DOut[202]=14Lect_22_Probability.nb 5In[208]:=p@9, 2DOut[208]=0Density mapping theoremSuppose we have a change of variables that maps a discrete set of x's uniquely to y's: X->Y.‡Discrete random variablesNo change to probability function. The mapping just corresponds to a change of labels, so the probabilities p(X)=p(Y).‡Continuous random variablesForm of probability density function does change because we require the probability "mass" to be unchanged: p(x)dx = p(y)dySuppose, y=f(x)pYHyL dy = pXHxL dx(In higher dimensions, the transformation is done by multiplying the density by the Jacobian, the determinant of the matrix of partial derivatives of the change of coordinates.)One can express the density mapping theorem as:pY(y)=ŸdHy - f HxLL f-1HxL pXHxL „ xover each monotonic part of


View Full Document
Download Lecture notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?