New version page

TEMPLE CIS 664 - A Bayesian Method for the Induction of Probabilistic Networks from Data

Pages: 25
Documents in this Course

This preview shows page 1-2-24-25 out of 25 pages.

View Full Document

End of preview. Want to read all 25 pages?

View Full Document
Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25CIS664 KD&DMG.F. Cooper, E. HerskovitsA Bayesian Method for the Induction of Probabilistic Networks from DataPresented by Uroš MidićMar 21, 2007IntroductionBayesian Belief NetworkLearning Network ParametersLearning Network StructureProbability of Network Structure Given a DatasetFinding the Most Probable Network StructureK2 algorithmExperimental resultPros and consIntroductionEvents A and B are independent ifP(A∩B)=P(A)P(B)or in the case that P(A)>0 and P(B)>0,P(A|B) = P(A) and P(B|A) = P(B)IntroductionDiscrete-valued random variables X and Y are independent iffor any a, b, the events X=a and Y=b are independent, i.e.P(X=a∩Y=b)=P(X=a)P(Y=b)OrP(X=a|Y=b) = P(X=a)IntroductionLet X, Y, and Z be three discrete-valued random variables. X is conditionally independent of Y given Z if for any a, b, c,P(X=a | Y=b,Z=c)=P(X=a|Z=c)This can be extended to sets of variables, e.g. we can say that X1…Xl is conditionally independent of Y1…Ym given Z1…Zn, ifP(X1…Xl | Y1…Ym, Z1…Zn) = P(X1…Xl | Z1…Zn)Introduction Let X = {X1, …, Xn} be a set of discrete-valued variables, and each variable Xi has a defined set of possible values V(Xi).The joint space of the set of variables X is defined as V(X1)xV(X2)x…xV(Xn).The probability distribution over the joint space specifies the probability for each of the possible variable bindings for the tuple (X1, …, Xn), and is called the joint probability distribution.Bayesian Belief Networks A Bayesian Belief Network represents the joint probability distribution for a set of variables.It specifies a set of conditional independence assumptions (in form of a Directed Acyclic Graph) and sets of local conditional probabilities (in form of tables).Each variable is represented by a node in the graph. A network arc represents the assertion that the variable is conditionally independent of its non-descendants, given its immediate predecessors.Bayesian Belief Networks Example:StormBusTourGroupLightning CampfireThunder ForestFireCampfire S,B S,¬B ¬S,B ¬S,¬B C 0.4 0.1 0.8 0.2¬C 0.6 0.9 0.2 0.8Bayesian Belief NetworksStormBusTourGroupLightning CampfireThunder ForestFireCampfire S,B S,¬B ¬S,B ¬S,¬B C 0.4 0.1 0.8 0.2¬C 0.6 0.9 0.2 0.8ExampleP(Campfire = True|Storm=True,BusTourGroup = True) = 0.4Bayesian Belief Networks For any assignment of values (x1,…,xn) to the variables (X1, …, Xn), we can compute the joint probabilityP(x1,…,xn) = ∏i=1..nP(xi|Parents(Xi))The values P(xi|Parents(Xi)) come directly from the tables associated with respective Xi.We can therefore recover any probability of the form P(X1|X2), where X1,X2 are subsets ofX = {X1, …, Xn}.Learning Network ParametersWe are given a network structure.We are given a dataset, such that for each training example all variables have assigned values, i.e. we always get a full assignment (x1,…,xn) to the variables (X1, …, Xn).Then learning the conditional probability tables is straightforward:1. Initialize then counter in all the tables to 0,2. For each training example increase all the appropriate counters in the tables3. Convert the counts into probabilities.Learning Network ParametersWe are given a network structure but the dataset has missing values.In this case, learning the conditional probability tables is not straightforward, and usually involves a gradient-ascent procedure to maximize P(D|h) – probability of the dataset given the model h.Note that D (dataset) is fixed and we search for the best h that maximizes P(D|h).Learning Network StructureThe assumption for learning the network parameters is that we know (or assume) a network structure.In simple cases, the network structure is constructed by a domain expert.In most cases a domain expert is not available, or the dataset is so complicated that even a domain expert is powerless.Example: Gene-expression microarray datasets have > 10K variables.Probability of N.S. Given a DatasetWe are given a dataset and two networks. Which of the networks is better fit for this dataset:Case x1x2x31 + - - S1:2 + + +3 - - +4 + + + S2:5 - - -6 - + +7 + + +8 - - -9 + + +10 - - -x1x2x3x1x2x3P(BS1|D) is much greaterthan P(BS2|D). S1 was used to generate the dataset.Probability of N.S. Given a DatasetWe are given a dataset and two networks.How to calculate P(BS,D)?        DBPDBPDPDBPDPDBPDBPDBPSSSSSS,,,,||212121Probability of N.S. Given a DatasetAssumptions made in the paper:1. The database variables are discrete2. Cases occur independently, given a belief-network model.3. There are no cases that have variables with missing values.4. Prior probabilities (before observing the data) for conditional probability assignments are uniform.Probability of N.S. Given a DatasetHow to calculate P(BS,D)?Xi has a set of parents πiXi has ri possible assignments (vi1, …, viri)There are qi unique instantiations (wi1, wi2, … , wiqi) for πi in D.Nijk is the number of cases in D in which variable xi has value vik and πi is instantiated as wij.Let Nij = ∑k=1..riNijkWith the four assumptions we get a formula:        niqjrkijkiijiSSi iNrNrBPDBP1 1 1!!1!1,Probability of N.S. Given a DatasetHow to calculate P(BS,D)?Surprisingly, after some indexing and constant bounding we get that the time complexity for this formula is O(mn) – linear in the number of variables and number of cases.        niqjrkijkiijiSSi iNrNrBPDBP1 1 1!!1!1,Probability of N.S. Given a DatasetWe have an efficient way to calculate P(BS,D)?We could calculate allP(BSi|D) = P(BSi,D)/(∑P(BS,D))and find the optimal one, but the set of possible BSi grows exponentially with n.However, if we had a situation where∑BS Y∊P(BS,D) ≈P(D),and Y is small enough, then we could efficiently approximate all P(BS|D) for BS Y.∊K2 algorithmWe start with:The time complexity is O(mn2r2n). However, if we assume that a node can have at most u parents, then the complexity is O(munrT(n,u)), where T(n,u) = ∑k=0..uchoose(n,k)        niqjrkijkiijiSSi iNrNrBPDBP1 1 1!!1!1,       

View Full Document Unlocking...