Bayesian Networks Structure Learning Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 2nd 2007 2005 2007 Carlos Guestrin Review Bayesian Networks Flu Compact representation for probability distributions Exponential reduction in number of parameters Fast probabilistic inference using variable elimination Sinus Headache Compute P X e Time exponential in tree width not number of variables Allergy Today Learn BN structure 2005 2007 Carlos Guestrin Nose Learning Bayes nets Known structure Unknown structure Fully observable data Missing data Data CPTs P Xi PaXi x 1 x m structure 2005 2007 Carlos Guestrin parameters Learning the CPTs Data For each discrete variable Xi x 1 x m 2005 2007 Carlos Guestrin Information theoretic interpretation of maximum likelihood Flu Allergy Sinus Given structure log likelihood of data 2005 2007 Carlos Guestrin Headache Nose Information theoretic interpretation of maximum likelihood Flu Allergy Sinus Given structure log likelihood of data 2005 2007 Carlos Guestrin Headache Nose Information theoretic interpretation of maximum likelihood 2 Flu Allergy Sinus Given structure log likelihood of data 2005 2007 Carlos Guestrin Headache Nose Decomposable score Log data likelihood Decomposable score Decomposes over families in BN node and its parents Will lead to significant computational efficiency Score G D i FamScore Xi PaXi D 2005 2007 Carlos Guestrin How many trees are there Nonetheless Efficient optimal algorithm finds best tree 2005 2007 Carlos Guestrin Scoring a tree 1 equivalent trees 2005 2007 Carlos Guestrin Scoring a tree 2 similar trees 2005 2007 Carlos Guestrin Chow Liu tree learning algorithm 1 For each pair of variables Xi Xj Compute empirical distribution Compute mutual information Define a graph Nodes X1 Xn Edge i j gets weight 2005 2007 Carlos Guestrin Chow Liu tree learning algorithm 2 Optimal tree BN Compute maximum weight spanning tree Directions in BN pick any node as root breadth firstsearch defines directions 2005 2007 Carlos Guestrin Can we extend Chow Liu 1 Tree augmented na ve Bayes TAN Friedman et al 97 Na ve Bayes model overcounts because correlation between features not considered Same as Chow Liu but score edges with 2005 2007 Carlos Guestrin Can we extend Chow Liu 2 Approximately learning models with tree width up to k Narasimhan Bilmes 04 But O nk 1 and more subtleties 2005 2007 Carlos Guestrin What you need to know about learning BN structures so far Decomposable scores Maximum likelihood Information theoretic interpretation Best tree Chow Liu Best TAN Nearly best k treewidth in O Nk 1 2005 2007 Carlos Guestrin Scoring general graphical models Model selection problem What s the best structure Flu Allergy Sinus Headache Nose Data x 1 1 x n 1 x 1 m x n m The more edges the fewer independence assumptions the higher the likelihood of the data but will overfit 2005 2007 Carlos Guestrin Maximum likelihood overfits Information never hurts Adding a parent always increases score 2005 2007 Carlos Guestrin Bayesian score avoids overfitting Given a structure distribution over parameters Difficult integral use Bayes information criterion BIC approximation equivalent as M Note regularize with MDL score Best BN under BIC still NP hard 2005 2007 Carlos Guestrin How many graphs are there 2005 2007 Carlos Guestrin Structure learning for general graphs In a tree a node only has one parent Theorem The problem of learning a BN structure with at most d parents is NP hard for any fixed d 2 Most structure learning approaches use heuristics Exploit score decomposition Quickly Describe two heuristics that exploit decomposition in different ways 2005 2007 Carlos Guestrin Learn BN structure using local search Starting from Chow Liu tree Local search possible moves Add edge Delete edge Invert edge 2005 2007 Carlos Guestrin Score using BIC What you need to know about learning BNs Learning BNs Maximum likelihood or MAP learns parameters Decomposable score Best tree Chow Liu Best TAN Other BNs usually local search with BIC score 2005 2007 Carlos Guestrin Unsupervised learning or Clustering K means Gaussian mixture models Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 2nd 2007 2005 2007 Carlos Guestrin Some Data 2005 2007 Carlos Guestrin K means 1 Ask user how many clusters they d like e g k 5 2005 2007 Carlos Guestrin K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 2005 2007 Carlos Guestrin K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 3 Each datapoint finds out which Center it s closest to Thus each Center owns a set of datapoints 2005 2007 Carlos Guestrin K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 3 Each datapoint finds out which Center it s closest to 4 Each Center finds the centroid of the points it owns 2005 2007 Carlos Guestrin K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 3 Each datapoint finds out which Center it s closest to 4 Each Center finds the centroid of the points it owns 5 and jumps there 6 Repeat until terminated 2005 2007 Carlos Guestrin Unsupervised Learning You walk into a bar A stranger approaches and tells you I ve got data from k classes Each class produces observations with a normal distribution and variance 2 I Standard simple multivariate gaussian assumptions I can tell you all the P wi s So far looks straightforward I need a maximum likelihood estimate of the i s No problem There s just one thing None of the data are labeled I have datapoints but I don t know what class they re from any of them Uh oh 2005 2007 Carlos Guestrin Gaussian Bayes Classifier Reminder p x y i P y i P y i x p x 1 1 T exp x x k i i k i pi m 2 1 2 2 i 2 P y i x p x How do we deal with that 2005 2007 Carlos Guestrin Predicting wealth from age 2005 2007 Carlos Guestrin Predicting wealth from age 2005 2007 Carlos Guestrin Learning modelyear mpg maker 2005 2007 Carlos Guestrin 21 12 M 1m 12 L 1m 2 2 L 2m M 2m M L 2 m O General parameters O m2 2005 2007 Carlos Guestrin 21 12 M 1m 12 L 1m 2 2 L 2m M 2m M L 2 m O Aligned O m parameters 21 0 0 22 0 0 M M 0 0 0 0 2005 2007 Carlos Guestrin 0 L 0 0 L 0 23 L M 0 O M 0 L 2 m 1 0 L 0 0 0 0 M 0 2 m Aligned O m parameters 21 0 0 22 0 0 M M 0 0 0 0 2005 2007 Carlos Guestrin 0 L 0 0 L 0 23 L M 0 O M 0 L 2 m 1 0 L 0 0 0 0 M 0 2 m Spherical O 1 cov
View Full Document