U of M PSY 5038 - Gaussian generative models, learning, and inference - D377652

Home> Schools> University of Minnesota- Twin Cities> Psychology (PSY) > PSY 5038> Gaussian generative models, learning, and inference

U of M PSY 5038 - Gaussian generative models, learning, and inference

School name University of Minnesota- Twin Cities

Course Psy 5038- Introduction to Neural Networks

Pages 30

Download Save

Unformatted text preview:

Introduction to Neural NetworksU. Minn. Psy 5038Gaussian generative models, learning, and inference‡Initialize standard library files:Off@General::spell1D;Last timeQuick review of probability and statisticsBasic rules of probabilitySuppose we know everything there is to know about a set of variables (A,B,C,D,E). What does this mean in terms of probability? It means that we know the joint distribution, p(A,B,C,D,E). In other words, for any particular combina-tion of values (A=a,B=b, C=c, D=d,E=e), we can calculate, look up in a table, or determine some way or another the number p(A=a,B=b, C=c, D=d,E=e), for any particular instances, a, b, c, d, e.‡Rule 1: Conditional probabilities from joints: The product ruleProbability about an event changes when new information is gained.Prob(X given Y) = p(X|Y)pHX YL=pHX, YLpHYLpHX, YL= pHX YLpHYLThe form of the product rule is the same for densities as for probabilities.IndependenceKnowledge of one event doesn't change the probability of another event. p(X)=p(X|Y) which by the product rule is:p(X,Y)=p(X)p(Y)‡Rule 2: Lower dimensional probabilities from joints: The sum rule (marginalization)pHXL=‚i=1NpHX, YHiLLpHxL=‡-¶¶pHx, yL„ x‡Rule 3: Bayes' ruleFrom the product rule, and since p[X,Y] = p[Y,X], we have:pHY XL=pHX YLpHYLpHXL, and using the sum rule, pHY XL=pHX YLpHYL⁄YpHX, YL‡Bayes Terminology in inferenceSuppose we have some partial data (see half of someone's face), and we want to recall or complete the whole. Or suppose that we hear a voice, and from that visualize the face. These are both problems of statistical inference. We've already studied how to complete a partial pattern using energy minimization, and how energy minimization can be viewed as probability maximization.We typically think of the Y term as a random variable over the hypothesis space (a face), and X as data or a stimulus (partial face, or sound). So for recalling a pattern Y from an input stimulus X, we'd like to have a function that tells us:p(Y | X) which is called the posterior probability of the hypothesis (e.g. description of the full face as output) given the stimulus (partial face as "data").-- i.e. what you get when you condition the joint by the probability of the stimulus data. The posterior is often what we'd like to base our decisions on, because it can be proved that picking the hypothesis Y which maximizes the posterior (i.e. maximum a posteriori or MAP estimation) minimizes the average probability of error.p(Y) is the prior probability of the hypothesis. Some hypotheses are "a priori" more likely than others. But even if it isn't made explicit, a model prior implicitly assumes conditions. Given a context, such as your room, some faces are more likely than others. For me an image patch stimulating my retina in my kitchen is much more likely to be my wife's than my brother's (who lives in another state). Priors are contingent, i.e. conditional on context, p(Y| context), even if the context is not made explicitly.p(X|Y) is the likelihood of the hypothesis. Note this is a probability of X, but not of Y.(The sum over X is one, but the sum over Y isn't necessarily one.)Applications to random samplingIf we know p(x), and are given a function, y=f(x), what is p(y)? pYHyL dy = pXHxL dxThis principle is used to make random number generators for general probability densities from the uniform distribu-tion. The result is that one can make a random draw from a uniform distribution p(x), from between 0 and 1, and goto the inverse CDF to read off the value of the random sample from p(y). 2 Lect_15_Probability2.nbTodayStatistics reviewExamples of computations on continuous probabilitiesExamples of computations on discrete probabilitiesIntroduction to Bayes learningProbability overview continued‡Bayes Terminology in visual perceptionp@S ID =p@I SDp@SDp@IDUsually, we will be thinking of the Y term as a random variable over the hypothesis space, and X as data. So for visual inference, Y = S (the scene), and X = I (the image data), and I = f(S).We'd like to have:p(S|I) is the posterior probability of the scene given the image-- i.e. what you get when you condition the joint by the image data. The posterior is often what we'd like to base our decisions on, because as we discuss below, picking the hypothesis S which maximizes the posterior (i.e. maximum a posteriori or MAP estimation) minimizes the average probability of error.p(S) is the prior probability of the scene.p(I|S) is the likelihood of the scene. Note this is a probability of I, but not of S.Lect_15_Probability2.nb 3Statistics‡Expectation & varianceAnalogous to center of mass:Definition of expectation or average:Average@XD = X = E@XD = S x@iD p@x@iDD ~‚i=1NxiêNm = E@XD=‡x pHxLdxSome rules:E[X+Y]=E[X]+E[Y]E[aX]=aE[X]E[X+a]=a+E[X]Definition of variance:s2=Var[X] = E[[X-m]^2]=⁄j=1NpjIxj- mMM2~⁄i=1NHxi- m L2ëNVar@XD=‡Hx - mL2pHxLdxStandard deviation:s = Var@XD4 Lect_15_Probability2.nbSome rules:Var@XD = EAX2E- E@XD2Var@aXD = a2Var@XD‡Covariance & CorrelationCovariance:Cov[X,Y] =E[[X - mX] [Y - mY] ]Correlation coefficient:r@X, YD =Cov@X, YDsXsY‡Covariance matrixSuppose now that X is a vector: {X1, X2,...} Then we can describe the covariance between pairs of elements of X:Sij= cov[Xi,Xj] =E[[Xi - mXi] [Xj - mXj] ]~⁄n=1NIxin- mXiMIxjn- mXiMTëNIn matrix form, the covariance can be written:S = cov[X]= E[ (X-E[X])HX - E@XDLT]In other words, the covariance matrix can be approximated by the average outer product. In the language of neural networks, it is a Hebbian matrix memory of pair-wise relationships.‡Independent random variablesIf p(X,Y)=p(X)p(Y), thenE@X YD = E@XDE@YDHuncorrelatedLCov@X, YD = r@X, YD = 0Var@X + YD = Var@XD+ Var@YDIf two random variables are uncorrelated, they are not necessarily independent. Two random variables are said to be orthogonal if their correlation is zero.Degree of belief vs., relative frequencyWhat is the probability that it will rain tomorrow? Assigning a number between 0 and 1 is assigning a degree of belief. These probabilities are also called subjective probabilities.What is the probability that a coin will come up heads? In this case, we can do an experiment. Flip the coin n times, and count the number of heads, say h[n], and then set the probability, p = h[n]/n -- the relative frequency. Of course, if we did it again, we may not get the same estimate of p. One solution often given is:p = limnØ¶hHnLnA problem with this, is that in

View Full Document


School:
Email:
New Password:
Confirm Password:

U of M PSY 5038 - Gaussian generative models, learning, and inference

Sign up for free to view:

Please select your school