DOC PREVIEW
UW-Madison CS 731 - Term Project

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Report for CS731 Term ProjectYue PanLearning Probabilities(CPTs) in Presence of Missing Data Using Gibbs Sampling1. IntroductionBelief networks (BN) (also known as Bayesian networks and directed probabilistic networks) are a graphical representation for probability distributions. These networks provide a compact and natural representation of uncertainty in artificial intelligence. They have been successfully applied in expert systems, diagnostic engines, and optimal decision-making systems.The most difficult and time consuming part of the task in building a Bayesian network model is coming up with the probabilities to quantify it. Probabilities can be derived from various sources. They can be obtained by interviewing domain experts to elicit their subjective probabilities. They can be gathered from published statistical studies or can be derived analytically from the combinatorics of some problems. Finally, they can be learned directly from raw data. The learning of Bayesian networks has been one of the most active areas of research within the Uncertainty in AI community in recent years.2. Fundamental of Bayesian NetworksBayesian networks are graphical models that encode probabilistic relationships among variables for problems of uncertain reasoning. They are composed of a structure and parameters. The structure is a directed acyclic graph that encodes a set of conditional independence relationships among variables. The nodes of the graph correspond directly to the variables and the directed arcs represent dependence of variables to their parents. The lack of directed arcs among variables represent a conditional independence relationship. Take, for example, the network in Figure 0. The lack of arcs between symptoms S1, S2, and S3 indicates that they are conditionally independent given C. In other words, knowledge of S1 is irrelevant to that of S2 given we already know the value of C. If C is not known, then knowledge of S1 is relevant to inferences about the value of S2.3. Experiment3.1 AlgorithmWe first initialize the states of all the unobserved variables in data set randomly. As a result, we have a complete sample data set. Second, we tally all the data points’ values in the data set to their corresponding original B-prior distributions and update all the CPT values based on these new B-distributions. We save the updated CPT. Third, based on the updated CPT parameters, we use Gibbs sampling to sample all the missing data points in the data set and get a complete data set again. We iterate second and third steps until the averages of all the saved CPTs are stable.3.2 Result And AnalysisIn this section we report initial experimental results on a 10-node network. (See the graph on the right). We assume all the nodes have only two states (Binomial distribution). This network has totally 33 CPT parameters (CPT entries). The minimum of the CPT parameter for a node is one, which is the case that the node doesn’t have parents at all. The maximum of the CPT parameters for a node is 8, which is the case that the node has three parents. Generally, we evaluate the performance of the learned networks by measuring how well they model the target distribution, i.e. how many CPT entries in the whole network are within certain error ranges compared to original (correct) CPT. The 5 error ranges we take are: [0.00, 0.05), [0.05, 0.10), [0.10, 0.15), [0.15, 0.20) and > 0.20.We generated all the original (correct) data points for the data set based on the original (correct) CPT. When we start test and set the deviation for CPT, we mean all the CPT entries are set off by the deviation (may plus or minus that deviation).We performed the following experiments. In all cases, the blue bar represents the number of CPT entries in the network whose learning results are off from the correct ones between 0.00 and 0.05. The red bar represents the number of entries with error between 0.05 and 0.10. The yellow bar represents those off between 0.10 and 0.15. The cyan bar shows those with the error ranging from 0.15 to 0.20. The last bar in each group shows the number of entries whose errors are larger than 0.20.Case 1:We set all the CPT off by 0.20, set node 6 100% missing, other nodes were observed. We input the data sets with varying size. From the raw data in table1 (See the file called size_on_node_6.doc in the hand in directory, same as below), we generated figure 1.Figure 1. The data set size affects learningThe result shows when data set size is 100, out of 33 CPT entries, 13 entries are off from the correct ones by [0.00, 0.05), 8 entries are off by [0.05, 0.10), 8 entries are off by [0.10, 0.15),and 4 entries are off by [0.15, 0.20). None of them are off by more than 0.20. Since the deviation we set is 0.20, data set size 100 is not good enough for “wrong CPT” to target the correct distribution.When the size of the data set was 500, we observed the number of CPT entries whose errors are within [0.00, 0.05) increased and those entries whose errors are within [0.05, 0.10) or [0.10, 0.15) decreased.This trend keeps until the size of the data set reaches 1000. When the data set size is 3000, none of the CPT entries has error more than 0.10. After that, there is no significant change of the error distribution. So we conclude that the size of data set will affects how well wrong CPTs target correct CPTs, but it’s not linear relationship.As for the missing node itself, i.e. node 6, the error trend of its CPT values is the same as the trend for other nodes. When the data set size is 100 or 500, its error is more than 0.10. When the data set size is more than 1000, its error is below 0.05.Since the data set size 3000 is fairly good for all wrong CPT values to “come back” to the correct ones, all our experiments below will use 3000 as the fixed data set size.Case 2:We set node 6 100% missing, other nodes were observed. All the CPTs were set off by the deviations shown below. From raw data in the table 2, we derived the following figure.Figure 2. Deviation affects learningWhen all the CPT entries were set 0.1 or 0.2 off their correct ones, most entries came back to the correct values. None of learning results is off by more than 0.10. When they were set off by 0.3, only one CPT entry was off by 0.10 after learning. But when all the CPT entries were set off by 0.4, we observed one entry were off by more than 0.20 after learning. The worst is there were 8 entries off by more than 0.20 after learning


View Full Document

UW-Madison CS 731 - Term Project

Download Term Project
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Term Project and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Term Project 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?