MIT 6 825 - Learning With Hidden Variables - D1018663

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 825> Learning With Hidden Variables

MIT 6 825 - Learning With Hidden Variables

School name Massachusetts Institute of Technology

Course 6 825- Techniques in Artificial Intelligence

Pages 15

Download Save

Unformatted text preview:

1Le cture 18 • 16.825 Techniques in Artificial IntelligenceLearning With Hidden Variables• Why do we want hidden variables?• Simple case of missing data• EM algorithm• Bayesian networks with hidden variablesLe cture 18 • 2Hidden variablesCauseE1E2En…O(n) parametersCause is unobservableE3E1E2E4O(2n)parametersWithout thecause, all theevidence isdependent oneach other2Le cture 18 • 3Missing Data• Given two variables, no independencerelations• Some data are missing• Estimate parameters in jointdistribution• Data must be missing at random0110H00000001111BALe cture 18 • 4Ignore itEstimated Parameters0110H00000001111BA2/71/7B1/73/7~BA~A.285.143B.143.429~BA~A! logPr(D M ) = log(Pr(D,H = 0 | M ) + Pr(D,H = 1 | M ))= 3log .429 + 2log.143 + 2log .285 + log(.429 + .143)= "9.4983Le cture 18 • 5Recitation ProblemShow the remaining steps required to get from thisexpressionto a number for the log likelihood of the observeddata given the model. Explain any assumptions you might have had tomake.! logPr(D M) = log(Pr(D,H = 0 | M ) + Pr(D,H = 1 | M ))Le cture 18 • 6Fill in With Best ValueEstimated Parameters0110000000001111BA2/81/8B1/84/8~BA~A.25.125B.125.5~BA~A! logPr(D M ) = log(Pr(D,H = 0 | M ) + Pr(D,H = 1 | M )= 3log .5 + 2log.125 + 2log .25 + log(. 5 + .125)= "9.4814Le cture 18 • 7Fill in With DistributionGuess a distribution over A,B andcompute a distribution over H0110H00000001111BA.25.25B.25.25~BA~A! "0! Pr(H D,"0) = Pr(H | D6,"0)= Pr(B |¬A,"0)= Pr(¬A, B |"0)/Pr(¬A |"0)= .25 / 0.5= 0.5Le cture 18 • 8Fill in With DistributionUse distribution over H to computebetter distribution over A,BMaximum likelihood estimation usingexpected counts01100, 0.51, 0.500000001111BA2/81.5/8B1/83.5/8~BA~A! "1.25.1875B.125.4375~BA~A5Le cture 18 • 9Fill in With DistributionUse new distribution over AB to get abetter distribution over H011000000001111BA! "1! Pr(H D,"1) = Pr(¬A,B |"1) /Pr(¬A |"1)= .1875 /.625= 0.3.25.1875B.125.4375~BA~ALe cture 18 • 10Fill in With DistributionUse distribution over H to computebetter distribution over A,B01100, 0.71, 0.300000001111BA2/81.3/8B1/83.7/8~BA~A! "2.25.1625B.125.4625~BA~A6Le cture 18 • 11Fill in With DistributionUse new distribution over AB to get abetter distribution over H011000000001111BA! "2! Pr(H D,"2) = Pr(¬A,B |"2) /Pr(¬A |"2)= .1625 /.625= 0.26.25.1625B.125.4625~BA~ALe cture 18 • 12Fill in With DistributionUse distribution over H to computebetter distribution over A,B01100, 0.741, 0.2600000001111BA2/81.26/8B1/83.74/8~BA~A! "3.25.1575B.125.4675~BA~A7Le cture 18 • 13Increasing Log-Likelihood.25.25B.25.25~BA~A! "0! "1.25.1875B.125.4375~BA~A! "2.25.1625B.125.4625~BA~A! "3.25.1575B.125.4675~BA~A! logPr(D |"3) = #9.4514! logPr(D |"2) = #9.4524! logPr(D |"1) = #9.4760! logPr(D |"0) = #10.3972ignore: -9.498best val: -9.481Le cture 18 • 14Deriving the EM Algorithm• Want to find to maximize• Instead, find , to maximize• Alternate between• holding fixed and optimizing• holding fixed and optimizing• g has same local and global optima as! "! Pr(D |")! ˜ P ! g(",˜ P ) =˜ P (H )log(Pr(D,H |") /˜ P (H ))H#= E˜ P log Pr(D,H |") $ log˜ P (H )! "! "! "! ˜ P ! ˜ P ! Pr(D |")8Le cture 18 • 15EM Algorithm• Pick initial• Loop until apparently converged••• Monotonically increasing likelihood• Convergence is hard to determine due to plateaus• Problems with local optima! "0! ˜ P t +1(H ) = Pr(H | D,"t)! "t +1= argmax"E˜ P t +1logPr(D,H |")Le cture 18 • 16EM for Bayesian Networks• D: observable variables• H: values of hidden variables in each case• Assume structure is known• Goal: maximum likelihood estimation of CPTs• Initialize CPTs to anything (with no 0’s)• Fill in the data set with distribution over values forhidden vars• Estimate CPTs using expected counts9Le cture 18 • 17Filling in the data• Distribution over H factors overthe M data cases• We really just need to compute a distribution overeach individual hidden variable• Each factor is a call to Bayes net inference! ˜ P t +1(H ) = Pr(H | D,"t)= Pr(Hm| Dm,"t)m#Le cture 18 • 18EM for BN: Simple CaseHD1D2Dn.2011.7000.3010.5111.2111.6101.1100.2010.9011Dn…D2D1! Pr(Hm| Dm,"t)Bayes netinference10Le cture 18 • 19EM for BN: Simple CaseHD1D2Dn.2011.7000.3010.5111.2111.6101.1100.2010.9011Dn…D2D1! Pr(Hm| Dm,"t)! E # (H " D2) = Pr(Hm| Dm,#t)I(D2m)m$= .9 + .2 + .2 + .5 + .3 + .2= 2.3! E # (H ) = Pr(Hm| Dm,"t)m#= 3.7! Pr(D2H) " 2.3/ 3. 7 = .6216Bayes netinferenceRe-estimate! "Le cture 18 • 20EM for BN: Worked ExampleHA B1010B41111060#A! "1= Pr(H )"2= Pr(A | H )"3= Pr(A |¬H )"4= Pr(B | H )"5= Pr(B |¬H )! Pr(Hm| Dm,"t)11Le cture 18 • 21EM for BN: Initial ModelHA B! Pr(H ) = 0.4Pr(A H ) = 0.55Pr(A¬H ) = 0.61Pr(B H ) = 0.43Pr(B¬H ) = 0.521010B41111060#A! Pr(Hm| Dm,"t)Le cture 18 • 22Iteration 1: Fill in dataHA B! Pr(H ) = 0.4Pr(A H ) = 0.55Pr(A¬H ) = 0.61Pr(B H ) = 0.43Pr(B¬H ) = 0.521010B.3341.4211.3910.4860#A! Pr(Hm| Dm,"t)12Le cture 18 • 23Iteration 1: Re-estimate ParamsHA B! Pr(H ) = 0.42Pr(A H ) = 0.35Pr(A¬H ) = 0.46Pr(B H ) = 0.34Pr(B¬H ) = 0.471010B.3341.4211.3910.4860#A! Pr(Hm| Dm,"t)Le cture 18 • 24Iteration 2: Fill in DataHA B! Pr(H ) = 0.42Pr(A H ) = 0.35Pr(A¬H ) = 0.46Pr(B H ) = 0.34Pr(B¬H ) = 0.471010B.2841.3911.3910.5260#A! Pr(Hm| Dm,"t)13Le cture 18 • 25Iteration 2: Re-estimate paramsHA B! Pr(H ) = 0.42Pr(A H ) = 0.31Pr(A¬H ) = 0.50Pr(B H ) = 0.30Pr(B¬H ) = 0.501010B.2841.2811.3910.5260#A! Pr(Hm| Dm,"t)Le cture 18 • 26Iteration 5HA B! Pr(H ) = 0.46Pr(A H ) = 0.09Pr(A¬H ) = 0.69Pr(B H ) = 0.09Pr(B¬H ) = 0.691010B.0541.3111.3110.7960#A! Pr(Hm| Dm,"t)14Le cture 18 • 27Iteration 10HA B! Pr(H ) = 0.52Pr(A H ) = 0.03Pr(A¬H ) = 0.83Pr(B H ) = 0.03Pr(B¬H ) = 0.831010B.00141.18311.18310.97160#A! Pr(Hm| Dm,"t)Le cture 18 • 28Increasing Log Likelihood-18-17-16-15-14-13-120 5 10 15 20iterationlog likelihoodnear .5all params .5near 015Le cture 18 • 29EM in BN issues• With multiple hidden nodes, take advantage ofconditional independencies• Lots of tricks to speed up computation of expectedcounts• If structure is unknown, add search operators toadd and delete hidden nodes• There are clever ways of search with unknownstructure and hidden

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT 6 825 - Learning With Hidden Variables

Sign up for free to view:

Please select your school