Duke STA 294 - Kozlov and Koller - D760907

Home> Schools> Duke University> Statistical Science (STA) > STA 294> Kozlov and Koller

Duke STA 294 - Kozlov and Koller

Pages 12

Download Save

Unformatted text preview:

1TODAYA bit more variational approximation.Kozlov and Koller: Dynamic DiscretizationVariational Approximation-5-4-3-2-101230 0.2 0.4 0.6 0.8 1xlog(1-exp(-x))()∑−−=∈ijfpadjijidxθθ 02Kozlov and Koller: Dynamic discretization• Discrete approximations to continuous distributions• Dynamic discretization:– Straw approach– Revised approach using weighted KL distance.DiscretizationApproximate a continuous probability density functionby piecewise constant function:3DiscretizationAssume that continuous variables are in [0,1]The full state space for n variables:Discretization:A piecewise constant functionDefines a set of mutually exclusive and collectively exhaustiveset of subregions[]n1,0=ΩmDL1:→Ω{}Ω in ,,1 mωωL12345Digression: KL-distanceWhy do we always use KL-distance?Basic information theory:Say that we want to efficiently encode a bunch of values with probabilitiesShannon’s encoding theorem says that the best that I can do isto code each value using bits.The expected total number of bits to encode M values is and the average number of bits/word isnyy ,,1Lnpp ,,1Ljy()()jjpp22lg1lg −=()jjjppM∑− lg()jjjppPH∑−= lg)(4Digression: EncodingSay that each have probability 1/4 the best encoding is 00, 01, 10, 11Say that the best encoding is: 0, 10, 110, 1114321,,, yyyy() () () ()81,41,214321==== yPyPyPyPDigression: KL-DistanceOK, so say that we are encoding samples from P, butour coding is based on another distribution Q?The expected number of bits to encode M letters drawnfrom P using a code based on Q is:If we had access to the distribution over P beforeencoding P, we would need only bits.()jjjqpM∑− lg()jjjppM∑− lg5Digression: KL-DistanceThe average number of “extra” bits needed to representP if we used the wrong encoding distribution Q is: (END DIGRESSION)()()()=−−=∑∑∑jjjjjjjjjjqppppqpQPKLlg lglg()()()()dxxqxpxpqpKL=∫lnDiscretizationIf a probability function is constant in each of thesubregions of discretization D, we will call it adiscretized function on D.How should I choose g to minimize ???Df()),,(),,(11 nnDxxDgxxfLL=()DffKLg(1)g(2)g(3)()xf6Optimal values for discretization() ( )dxxfigi∫=ωThe g that minimizes the KL distance is just:that is, g is the average value of f over eachdiscretization region.The KL distance to any other piecewise constantfunction on D is given by the sum of the KLdistances:()()()ffKLfhKLfhKLDDDD+=How do we generate discretizations?Consider only discretizations given by Binary SplitPartition treesabcd exyecdyxyxba7Heuristic for generating good BSPsabcd eyxFind leaf i with themaximum KL-distanceToo hard to integrate, sobound:()()()()()()=∫∫xfxfxfigxfxfiilnlnωω()()()()()()()( )()iffffffffffffffifxfxfiωω−−−−+−≤∫minmaxminmaxmaxmaxminminminmaxloglogloglnRemember to show viewgraph with discretizations.Operations on discretized functions.Need to align the discretizations for both trees. Values for leaves are sums of corresponding leaves for both trees.abcd eCBA+abcd ea1a28One possible join tree algorithm.C1=x,y C2=y,zS=y1. Discretize C1 and C22. Absorb from C1 to C2() ( )() ()()()yyzyzydxyxyssccxcsφφφφφφ’,,’,’==∫3. Absorb from C1 back to C2() ( )() ()()()yyyxyxdzzyysscczcs’’’,,’,’’’φφφφφφ==∫Absorption ensures that C1 and C2 are locally consistent: () ()∫∫=zcxcdzzydxyx ,,φφBetter to inform discretization on previousobservations.[Call this “propagation with discretization”]C1 C2S1. Discretize C12. Marginalize C1 to S3. Discretize C2 * S.9ExampleX1o1X2o2X3o3()11=xp()[]01.0,0;~|1212xxNxxp −()[]01.0,0;~|2323xxNxxp −()[]01.0,0;~|1111oxNxop−()[]01.0,0;~|2222oxNxop−()()()5.04033311|−+==xexTrueopThe problem…C1 C2SSay that we discretize C2, and then propagate to C1.The discretization for C1 (the root) is pretty good, because itincludes all of the “messages” from all of the other evidence.The discretization for C2 (the leaf), however, does not includethe message from C1. If the prior for C2 and the likelihooddue to the message from C1 are very different (low P(E)),the discretization is poor.Remember example.…10Dynamic Discretization1. Use weighted KL distance to select the discretization.2. Goal is to assign weights to cliques so that minimizing the WKL distance minimizes the error for the probabilityof the query node q given evidence e.()()()()()dxxqxpxpxwwqpWKL=∫ln;()0>xwSelecting the weightsAssume that the weight for a parent clique is alreadyknown.How do we select the weight for a child clique tominimize the WKL distance for the parent.The weight for the root clique is 1.Proof: The weight functions for adjacent cliques mustobey:()() ()()dzzyzywdxyxyxwcc,,,,φφ∫∫=11Weight PropagationThis last relationship looks like local consistency…So, update weights going down the tree by usingabsorption to compute the product()() ()()dzzyzywdxyxyxwcc,,,,φφ∫∫=cwφThe algorithmIteratively determine weights.1. Assign a weight of 1 to all cliques.2. Use standard propagation to update all of themessages toward the root (goal) clique.3. Using the resulting clique potentials, determine newweights by propagating the product of the weight andthe clique potential back down the tree.Repeat 2 and 3.12If discretization is perfect...abbcScdS()cedpd|,3=φ()baeebpw ,,2=()cbaeeecpw ,,,3=abcd()beecpdc|,,2=φ()dcbaeeeebap ,,,,,1=φ()dcbaeeeebapw ,,,,,11=φ()dcbaeeeecbpw ,,,,,22=φ()dcbaeeeedcpw ,,,,,33=φNext TimeNext week:SkiingThe two weeks after that:Normal distributions.Junction tree algorithm.Learning distributions (2 lectures).Conjugate distributions.PlatesLearning structure (4

View Full Document


School:
Email:
New Password:
Confirm Password:

Duke STA 294 - Kozlov and Koller

Sign up for free to view:

Please select your school