MIT 6 825 - Learning With Hidden Variables - D2715846

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 825> Learning With Hidden Variables

MIT 6 825 - Learning With Hidden Variables

School name Massachusetts Institute of Technology

Course 6 825- Techniques in Artificial Intelligence

Pages 87

Download Save

Unformatted text preview:

1Lecture 18 • 16.825 Techniques in Artificial IntelligenceLearning With Hidden Variables• Why do we want hidden variables?In this lecture, we’ll think about how to learn Bayes Nets with hidden variables. We’ll start out by looking at why you’d want to have models with hidden variables.2Lecture 18 • 26.825 Techniques in Artificial IntelligenceLearning With Hidden Variables• Why do we want hidden variables?• Simple case of missing dataThen, because the technique we’ll use for working with hidden variables is a bit complicated. we’ll start by looking at a simpler problem, of estimating probabilities when some of the data are missing.3Lecture 18 • 36.825 Techniques in Artificial IntelligenceLearning With Hidden Variables• Why do we want hidden variables?• Simple case of missing data•EM algorithmThat will lead us to the EM algorithm, in general,4Lecture 18 • 46.825 Techniques in Artificial IntelligenceLearning With Hidden Variables• Why do we want hidden variables?• Simple case of missing data•EM algorithm• Bayesian networks with hidden variablesAnd we’ll finish by seeing how to apply it to bayes nets with hidden nodes, and we’ll work a simple example of that in great detail.5Lecture 18 • 5Hidden variablesWhy would we ever want to learn a Bayesian network with hidden variables? One answer is: because we might be able to learn lower-complexity networks that way. Another is that sometimes such networks reveal interesting structure in our data.6Lecture 18 • 6Hidden variablesE3E1E2E4Consider a situation in which you can observe a whole bunch of different evidence variables, E1 through En. Maybe they’re all the different symptoms that a patient might have. Or maybe they represent different movies and whether someone likes them.7Lecture 18 • 7Hidden variablesE3E1E2E4O(2n) parametersWithout the cause, all the evidence is dependent on each otherIf those variables are all conditionally dependent on one another, then we’d need a highly connected graph that’s capable of representing the entire joint distribution between the variables. Because the last node has n-1 parents, it will take on the order of 2^n parameters to specify the conditional probability tables in this network.8Lecture 18 • 8Hidden variablesCauseE1E2En…Cause is unobservableE3E1E2E4O(2n) parametersWithout the cause, all the evidence is dependent on each otherBut, in some cases, we can get a considerably simpler model by introducing an additional “cause” node. It might represent the underlying disease state that was causing the patients’ symptoms or some division of people into those who like westerns and those who like comedies.9Lecture 18 • 9Hidden variablesCauseE1E2En…O(n) parametersCause is unobservableE3E1E2E4O(2n) parametersWithout the cause, all the evidence is dependent on each otherIn the simpler model, the evidence variables are conditionally independent given the causes. That means that it would only require on the order of n parameters to describe all the CPTs in the network, because at each node, we just need a table of size 2 (if the cause is binary; or k if the cause can take on k values), and one (or k-1) parameter to specify the probability of the cause.10Lecture 18 • 10Hidden variablesCauseE1E2En…O(n) parametersCause is unobservableE3E1E2E4O(2n) parametersWithout the cause, all the evidence is dependent on each otherSo, what if you think there’s a hidden cause? How can you learn a network with unobservable variables?11Lecture 18 • 11Missing Data• Given two variables, no independence relations0110H00000001111BAWe’ll start out by looking at a very simple case. Imagine that you have two binary variables A and B, and you know they’re not independent. So you’re just trying to estimate their joint distribution. Ordinarily, you’d just count up how many were true, true; how many were false, false; and so on, and divide by the total number of data cases to get your maximum likelihood probability estimates.12Lecture 18 • 12Missing Data• Given two variables, no independence relations• Some data are missing0110H00000001111BABut in our case, some of the data are missing. If a whole case were missing, there wouldn’t be much we could do about it; there’s no real way to guess what it might have been that will help us in our estimation process. But if some variables in a case are filled in, and others are missing, then we’ll see how to make use of the variables that are filled in and how to get a probability distribution on the missing data.13Lecture 18 • 13Missing Data• Given two variables, no independence relations• Some data are missing• Estimate parameters in joint distribution0110H00000001111BAHere, in our example, we have 8 data points, but one of them is missing a value for B.14Lecture 18 • 14Missing Data• Given two variables, no independence relations• Some data are missing• Estimate parameters in joint distribution• Data must be missing at random0110H00000001111BAIn order for the methods we’ll talk about here to be of use, the data have to be missing at random. That means that the fact that a data item is missing is independent of the value it would have had. So, for instance, if you didn’t take somebody’s blood pressure because he was already dead, then that reading would not be missing at random! But if the blood-pressure instrument had random failures, unrelated to the actual blood pressure, then that data would be missing at random.15Lecture 18 • 15Ignore itEstimated Parameters0110H00000001111BA2/71/7B1/73/7~BA~A.285.143B.143.429~BA~AThe simplest strategy of all is to just ignore any cases that have missing values. In our example, we’d count the number of cases in each bin and divide by 7 (the number of complete cases).16Lecture 18 • 16Ignore itEstimated Parameters0110H00000001111BA2/71/7B1/73/7~BA~A.285.143B.143.429~BA~AlogPr(DM)=log(Pr(D,H=0|M)+Pr(D,H=1|M))= 3log .429 + 2log.143 + 2log.285 + log(.429 + .143)=−9.498It’s easy, and it gives us a log likelihood score of –9.498. Whether that’s good or not remains to be seen. We’ll have to see what results we get with other methods.17Lecture 18 • 17Ignore itEstimated Parameters0110H00000001111BA2/71/7B1/73/7~BA~A.285.143B.143.429~BA~AlogPr(DM)=log(Pr(D,H=0|M)+Pr(D,H=1|M))= 3log .429 + 2log.143 + 2log.285 + log(.429 + .143)=−9.498Note that, in order to compute the log likelihood of the actual data (which

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT 6 825 - Learning With Hidden Variables

Sign up for free to view:

Please select your school