©2005-2007 Carlos Guestrin1EM for Bayes NetsMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 16th, 20072©2005-2007 Carlos GuestrinLearning HMMs from fully observable data is easyX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5= Learn 3 distributions:What if O is observed, but X is hidden3©2005-2007 Carlos GuestrinLog likelihood for HMMs when Xis hidden Marginal likelihood – O is observed, X is missing For simplicity of notation, training data consists of only one sequence: If there were m sequences:4©2005-2007 Carlos GuestrinComputing Log likelihood for HMMs when X is hiddenX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=5©2005-2007 Carlos GuestrinThe M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,o) Use EQ(t+1)[Count(x,o)]X1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=6©2005-2007 Carlos GuestrinE-step revisited E-step computes probability of hidden vars xgiven o Must compute: Q(xt=a|o) – marginal probability of each position Just forwards-backwards! Q(xt+1=a,xt=b|o) – joint distribution between pairs of positionsX1= {a,…z}O1= X5= {a,…z}X3= {a,…z} X4= {a,…z}X2= {a,…z}O2= O3= O4= O5=7©2005-2007 Carlos GuestrinExploiting unlabeled data in clustering A few data points are labeled <x,o> Most points are unlabeled <?,o> In the E-step of EM: If i’th point is unlabeled: compute Q(X|oi) as usual If i’th point is labeled: set Q(X=x|oi)=1 and Q(X≠x|oi)=0 M-step as usual8©2005-2007 Carlos Guestrin20 Newsgroups data – advantage of adding unlabeled data9©2005-2007 Carlos GuestrinData likelihood for BNs Given structure, log likelihood of fully observed data:FluAllergySinusHeadacheNose10©2005-2007 Carlos GuestrinMarginal likelihood What if S is hidden?FluAllergySinusHeadacheNose11©2005-2007 Carlos GuestrinLog likelihood for BNs with hidden data Marginal likelihood – O is observed, H is hiddenFluAllergySinusHeadacheNose12©2005-2007 Carlos GuestrinE-step for BNs E-step computes probability of hidden vars h given o Corresponds to inference in BNFluAllergySinusHeadacheNose13©2005-2007 Carlos GuestrinThe M-step for BNs Maximization step: Use expected counts instead of counts: If learning requires Count(h,o) Use EQ(t+1)[Count(h,o)]FluAllergySinusHeadacheNose14©2005-2007 Carlos GuestrinM-step for each CPT M-step decomposes per CPT Standard MLE: M-step uses expected counts:FluAllergySinusHeadacheNose15©2005-2007 Carlos GuestrinComputing expected counts M-step requires expected counts: For a set of vars A, must compute ExCount(A=a) Some of A in example j will be observed denote by AO= aO(j) Some of A will be hidden denote by AH Use inference (E-step computes expected counts): ExCount(t+1)(AO= aO(j), AH= aH) ← P(AH= aH, AO= aO(j)|θ(t))FluAllergySinusHeadacheNose16©2005-2007 Carlos GuestrinData need not be hidden in the same way When data is fully observed A data point is When data is partially observed A data point is But unobserved variables can be different for different data points e.g., Same framework, just change definition of expected counts ExCount(t+1)(AO= aO(j), AH= aH) ← P(AH= aH, AO= aO(j)|θ(t))FluAllergySinusHeadacheNose17©2005-2007 Carlos GuestrinWhat you need to know EM for Bayes Nets E-step: inference computes expected counts Only need expected counts over Xiand Paxi M-step: expected counts used to estimate parameters Hidden variables can change per datapoint Use labeled and unlabeled data → some data points are complete, some include hidden variables18©2005-2007 Carlos GuestrinAnnouncements No recitation this week On Wednesday, Special lecture on learning with text data by Prof. Noah Smith (LTI)©2005-2007 Carlos Guestrin19Co-Training for Semi-supervised learningMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 16th, 200720©2005-2007 Carlos GuestrinRedundant information21©2005-2007 Carlos GuestrinRedundant information – webpage text22©2005-2007 Carlos GuestrinRedundant information – anchor text for hyperlinks23©2005-2007 Carlos GuestrinExploiting redundant information in semi-supervised learning Want to predict Y from features X f(X) Y have some labeled data L lots of unlabeled data U Co-training assumption: X is very expressive X = (X1,X2) can learn g1(X1) Y g2(X2) Y24©2005-2007 Carlos GuestrinCo-Training Key idea: Classifier1and Classifier2must: Correctly classify labeled data Agree on unlabeled data25©2005-2007 Carlos GuestrinCo-Training Algorithm [Blum & Mitchell ’99]26©2005-2007 Carlos GuestrinCo-Training experimental results27©2005-2007 Carlos GuestrinCo-Training theory Want to predict Y from features X f(X) Y Co-training assumption: X is very expressive X = (X1,X2) want to learn g1(X1) Y and g2(X2) Y Assumption: ∃ g1, g2, ∀ x g1(x1) = f(x), g2(x2) = f(x) Questions: Does unlabeled data always help? How many labeled examples do I need? How many unlabeled examples do I need?28©2005-2007 Carlos GuestrinUnderstanding Co-Training: A simple setting Suppose X1and X2are discrete |X1| = |X2| = N No label noise Without unlabeled data, how hard is it to learn g1(or g2)?29©2005-2007 Carlos GuestrinCo-Training in simple setting –Iteration 030©2005-2007 Carlos GuestrinCo-Training in simple setting –Iteration 131©2005-2007 Carlos GuestrinCo-Training in simple setting – after convergence32©2005-2007 Carlos GuestrinCo-Training in simple setting –Connected components Suppose infinite unlabeled data Co-training must have at least one labeled example in each connected component of L+U graph What’s probability of making an error? For k Connected components, how much labeled data?33©2005-2007 Carlos GuestrinHow much unlabeled data?34©2005-2007 Carlos GuestrinCo-Training theory Want to predict Y from features X f(X) Y Co-training assumption: X is very expressive X = (X1,X2) want to learn g1(X1) Y and g2(X2) Y Assumption: ∃ g1, g2, ∀ x g1(x1) = f(x), g2(x2) =
View Full Document