Bayesian Networks Structure Learning Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University 2005 2009 Carlos Guestrin November 16th 2009 1 Review Bayesian Networks Flu Compute P X e Time exponential in tree width not number of variables Allergy Sinus Fast probabilistic inference using variable elimination Compact representation for probability distributions Exponential reduction in number of parameters Headache Nose Today Learn BN structure 2005 2009 Carlos Guestrin 2 1 Learning Bayes nets Data CPTs P Xi PaXi x 1 x m structure parameters 3 2005 2009 Carlos Guestrin Learning the CPTs Data For each discrete variable Xi x 1 x m 2005 2009 Carlos Guestrin 4 2 Information theoretic interpretation of maximum likelihood 1 Flu Allergy Sinus Given structure log likelihood of data Nose Headache 5 2005 2009 Carlos Guestrin Information theoretic interpretation of maximum likelihood 2 Flu Allergy Sinus Given structure log likelihood of data 2005 2009 Carlos Guestrin Nose Headache 6 3 Information theoretic interpretation of maximum likelihood 3 Flu Allergy Sinus Given structure log likelihood of data 2005 2009 Carlos Guestrin Nose Headache 7 Decomposable score Log data likelihood Decomposable score Decomposes over families in BN node and its parents Will lead to significant computational efficiency Score G D i FamScore Xi PaXi D 2005 2009 Carlos Guestrin 8 4 How many trees are there Nonetheless Efficient optimal algorithm finds best tree 2005 2009 Carlos Guestrin 9 Scoring a tree 1 equivalent trees 2005 2009 Carlos Guestrin 10 5 Scoring a tree 2 similar trees 2005 2009 Carlos Guestrin 11 Chow Liu tree learning algorithm 1 For each pair of variables Xi Xj Compute empirical distribution Compute mutual information Define a graph Nodes X1 Xn Edge i j gets weight 2005 2009 Carlos Guestrin 12 6 Chow Liu tree learning algorithm 2 Optimal tree BN Compute maximum weight spanning tree Directions in BN pick any node as root breadth firstsearch defines directions 2005 2009 Carlos Guestrin 13 Can we extend Chow Liu 1 Tree augmented na ve Bayes TAN Friedman et al 97 Na ve Bayes model overcounts because correlation between features not considered Your homework 2005 2009 Carlos Guestrin 14 7 Can we extend Chow Liu 2 Approximately learning models with tree width up to k Chechetka Guestrin 07 But O n2k 6 2005 2009 Carlos Guestrin 15 What you need to know about learning BN structures so far Decomposable scores Maximum likelihood Information theoretic interpretation Best tree Chow Liu Best TAN Nearly best k treewidth in O N2k 6 2005 2009 Carlos Guestrin 16 8 Scoring general graphical models Model selection problem What s the best structure Flu Allergy Sinus Headache Nose Data x1 1 xn 1 x1 m xn m The more edges the fewer independence assumptions the higher the likelihood of the data but will overfit 2005 2009 Carlos Guestrin 17 Maximum likelihood overfits Information never hurts Adding a parent always increases score 2005 2009 Carlos Guestrin 18 9 Bayesian score avoids overfitting Given a structure distribution over parameters Difficult integral use Bayes information criterion BIC approximation equivalent as M 1 Note regularize with MDL score Best BN under BIC still NP hard 2005 2009 Carlos Guestrin 19 Structure learning for general graphs In a tree a node only has one parent Theorem The problem of learning a BN structure with at most d parents is NP hard for any fixed d 1 Most structure learning approaches use heuristics Exploit score decomposition Quickly Describe the simplest heuristic that exploits decomposition 2005 2009 Carlos Guestrin 20 10 Learn BN structure using local search Starting from Chow Liu tree Local search possible moves Add edge Delete edge Invert edge Score using BIC 2005 2009 Carlos Guestrin 21 What you need to know about learning BNs Learning BNs Maximum likelihood or MAP learns parameters Decomposable score Best tree Chow Liu Best TAN Other BNs usually local search with BIC score 2005 2009 Carlos Guestrin 22 11 Unsupervised Learning Clustering K means Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University 2005 2009 Carlos Guestrin November 16th 2009 2005 2007 Carlos Guestrin 23 23 Some Data 2005 2009 Carlos Guestrin 24 12 K means 1 Ask user how many clusters they d like e g k 5 2005 2009 Carlos Guestrin 25 K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 2005 2009 Carlos Guestrin 26 13 K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 3 Each datapoint finds out which Center it s closest to Thus each Center owns a set of datapoints 2005 2009 Carlos Guestrin 27 K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 3 Each datapoint finds out which Center it s closest to 4 Each Center finds the centroid of the points it owns 2005 2009 Carlos Guestrin 28 14 K means 1 Ask user how many clusters they d like e g k 5 2 Randomly guess k cluster Center locations 3 Each datapoint finds out which Center it s closest to 4 Each Center finds the centroid of the points it owns 5 and jumps there 6 Repeat until terminated 2005 2009 Carlos Guestrin 29 K means Randomly initialize k centers 0 1 0 k 0 Classify Assign each point j 1 m to nearest center Recenter i becomes centroid of its point Equivalent to 2005 2009 Carlos Guestrin i average of its points 30 15 What is K means optimizing Potential function F C of centers and point allocations C Optimal K means min minC F C 2005 2009 Carlos Guestrin 31 Does K means converge Part 1 Optimize potential function Fix optimize C 2005 2009 Carlos Guestrin 32 16 Does K means converge Part 2 Optimize potential function Fix C optimize 2005 2009 Carlos Guestrin 33 Coordinate descent algorithms Want mina minb F a b Coordinate descent fix a minimize b fix b minimize a repeat Converges if F is bounded to a often good local optimum as we saw in applet play with it K means is a coordinate descent algorithm 2005 2009 Carlos Guestrin 34 17
View Full Document