U of M PSY 5038 - Belief Propagation - D2977395

Home> Schools> University of Minnesota- Twin Cities> Psychology (PSY) > PSY 5038> Belief Propagation

U of M PSY 5038 - Belief Propagation

School name University of Minnesota- Twin Cities

Course Psy 5038- Introduction to Neural Networks

Pages 19

Download Save

Unformatted text preview:

Introduction to Neural NetworksU. Minn. Psy 5038Daniel KerstenBelief PropagationInitializeIn[1]:=Off@General::spell1D;Needs@"ErrorBarPlots`"DLast timeNeural population codes representing probability distributionsGenerative modeling: Multivariate gaussian, mixtures‡Drawing samples‡Mixtures of gaussians‡Will use mixture distributions in the next lecture on EM application to segmentationIntroduction to Optimal inference and taskIntroduction to Bayesian learningInterpolation using smoothness: Gradient descentFor simplicity, we'll assume 1-D as in the lecture on sculpting the energy function. In anticipation of formulating the problem in terms of a graph that represents conditional probability dependence, we represent observable depth cues by y*, and the true ("hidden") depth estimates by y.2 Lect_24_BeliefProp.nbFor simplicity, we'll assume 1-D as in the lecture on sculpting the energy function. In anticipation of formulating the problem in terms of a graph that represents conditional probability dependence, we represent observable depth cues by y*, and the true ("hidden") depth estimates by y.(Figure from Weiss (1999).First-order smoothnessEarlier we saw that under specific assumptions, biologically plausible neural updating can be seen decrease the value of an energy (or "cost") function. One can also start off with an assumed cost function, determined say by a set of contraints, and use gradient descent to derive an update rule that minimizes the cost. (See supplementary materialin Lecture 20). How-ever, such an update rule will not necessarily resemble a biologically plausible neural mechanism. For an interpolation problem, we can write the energy or cost function by:where wk= xs[[k]] is an "indicator function", and yk*= d, are the data values. The indicator function is 1 if there is data available, and zero otherwise. . Gradient descent gives the following local update rule:l is a free parameter that controls the degree of smoothness, i.e. smoothness at the expense of fidelity to the data.Gauss-Seidel: h[k_]:=1/(l+xs[[k]])Successive over-relaxation (SOR): h2[k_]:=1.9/(l+xs[[k]]);Lect_24_BeliefProp.nb 3A simulation: Straight line with random missing data points‡Make the dataConsider the problem of interpolating a set of points with missing data, marked by an indicator function with the following notation: wk= xs[[k]], y* = data, y=f.We'll assume the true model is that f=y=j, where j=1 to size. data is a function of the sampling process on f=jIn[3]:=size = 32; xs = Table@0, 8i, 1, size<D; xsP1T = 1; xsPsizeT = 1;data = Table@N@jD xsPjT, 8j, 1, size<D;g3 = ListPlot@Table@N@jD, 8j, 1, size<D, Joined Ø True,PlotStyle Ø 8RGBColor@0, 0.5, 0D<D;g2 = ListPlot@data, Joined Ø False,PlotStyle Ø [email protected], [email protected], 0., 0D, PointSize@LargeD<D;The green line shows the a straight line connecting the data points. The red dots on the abscissa mark the points where data are missing.In[4]:=Show@8g2, g3<, ImageSize Ø MediumDOut[4]=4 Lect_24_BeliefProp.nbLet's set up two matrices, Tm and Sm such that the gradient of the energy is equal to: Tm . f - Sm . f. Sm will be our filter to exclude non-data points. Tm will express the "smoothness" constraint.In[5]:=Sm = DiagonalMatrix[xs];Tm = Table[0,{i,1,size},{j,1,size}];For[i=1,i<=size,i++,Tm[[i,i]] = 2];Tm[[1,1]]=1;Tm[[size,size]]=1; (*Adjust for the boundaries*)For[i=1,i<size,i++, Tm[[i+1,i]] = -1];For[i=1,i<size,i++, Tm[[i,i+1]] = -1];Check the update rule code for small size=10:In[11]:=Clear@f, d, lDHl * Tm.Array@f, sizeD- Sm.HHArray@d, sizeDL- Array@f, sizeDLL êêMatrixForm ;‡Run gradient descentIn[13]:=Clear[Tf,f1];dt = 1; l=2;Tf[f1_] := f1 - dt*(1/(l+xs))*(Tm.f1 - l*Sm.(data-f1));We will initialize the state vector to zero, and then run the network for iter iterations:In[46]:=f0 = Table[0,{i,1,size}];result=f0;f=f0;iter=25;Now plot the interpolated function.AnimateBresult = Nest@Tf, f, iterD;f = result;g1 = ListPlot@result, Joined Ø False, AspectRatio Ø Automatic,PlotRange Ø 880, size<, 8-1, size + 1<<, ImageSize Ø MediumD;ShowB:g1, g2, g3,GraphicsB:TextB"Iteration=" <> ToString@iterD, :size2,size2>F>F>,PlotRange Ø 8-1, size + 1<F, 8j, 1, 10<FLect_24_BeliefProp.nb 5Out[45]=jTry starting with f = random values. Try various numbers of iterations.6 Lect_24_BeliefProp.nbTry starting with f = random values. Try various numbers of iterations.Try different sampling functions xs[[i]].Belief PropagationSame interpolation problem, but now using belief propagationExample is taken from Yair Weiss.(Weiss, 1999)Probabilistic generative model(1)data@@iDD = y*@iD = xs@@iDDy@@iDD + dnoise, dnoise~N@0, sDD(2)y@@i + 1DD = y@@iD D+ znoise, znoise~N@0, sRDThe first term is the "data formation" model, i.e. how the data is directly influenced by the interaction of the underlying influences or causes: y with the sampling and noise. The second term reflects our prior assumptions about the smoothness of y, i.e. nearby y's are correlated, and in fact identical except for some added noise. So with no noise the prior reflects the assumption that lines are horizontal--all y's are the same.Some theoryWe'd like to know the distribution of the random variables at each node i, conditioned on all the data: I.e. we want the posteriorp(Yi=u | all the data) If we could find this, we'd be able to: 1) say what the most probable value of the y value is, and 2) give a measure of confidenceLet p(Yi=u | all the data) be normally distributed: NormalDistribution[mi,si].Consider the ith unit. The posterior p(Yi=u | all the data) = Lect_24_BeliefProp.nb 7Consider the ith unit. The posterior p(Yi=u | all the data) = (3)p(Yi=u| all the data) µ∝ p(Yi=u | data before i) p(data at i | Yi=u) p(Yi=u | data after i)Suppose that p(Yi=u | data before i) is also gaussian: p(Yi=u | data before i) = a[u] ~ NormalDistribution[ma,sa]and so is probability conditioned on the data after i:p(Yi=u | data after i)= b[u] ~ NormalDistribution[mb,sb]And the noise model for the data: p(data at i | Yi=u) = L[u]~ NormalDistribution@yp, sDDyp=data[[i]]So in terms of these functions, the posterior probability of the ith unit taking on the value u can be expressed as propor-tional to a product of the three factors:(4)p(Yi=u | all the data) µ∝ a[u]*L[u]*b[u]audist = NormalDistribution@ma, saD;a@uD = PDF@audist, uD;Ddist = NormalDistribution@yp, sDD;L@uD = PDF@Ddist, uD;budist = NormalDistribution@mb, sbD;b@uD = PDF@budist, uD;a@uD* L@uD* b@uD‰-Hu-maL22 sa2-Hu-mbL22 sb2-Iu-ypM22

View Full Document


School:
Email:
New Password:
Confirm Password:

U of M PSY 5038 - Belief Propagation

Sign up for free to view:

Please select your school