©2005-2007 Carlos Guestrin1EM for Bayes NetsMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 16th, 20072©2005-2007 Carlos GuestrinLearning HMMs from fullyobservable data is easyX1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 = Learn 3 distributions:What if O is observed, but X is hidden3©2005-2007 Carlos GuestrinLog likelihood for HMMs when X ishidden Marginal likelihood – O is observed, X is missing For simplicity of notation, training data consists of only one sequence: If there were m sequences:4©2005-2007 Carlos GuestrinComputing Log likelihood forHMMs when X is hiddenX1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 =5©2005-2007 Carlos GuestrinThe M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,o) Use EQ(t+1)[Count(x,o)]X1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 =6©2005-2007 Carlos GuestrinE-step revisited E-step computes probability of hidden vars xgiven o Must compute: Q(xt=a|o) – marginal probability of each position Just forwards-backwards! Q(xt+1=a,xt=b|o) – joint distribution between pairsof positionsX1 = {a,…z}O1 = X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}O2 = O3 = O4 = O5 =7©2005-2007 Carlos GuestrinExploiting unlabeled data inclustering A few data points are labeled <x,o> Most points are unlabeled <?,o> In the E-step of EM: If i’th point is unlabeled: compute Q(X|oi) as usual If i’th point is labeled: set Q(X=x|oi)=1 and Q(X≠x|oi)=0 M-step as usual8©2005-2007 Carlos Guestrin20 Newsgroups data – advantageof adding unlabeled data9©2005-2007 Carlos GuestrinData likelihood for BNs Given structure, log likelihood of fullyobserved data:FluAllergySinusHeadacheNose10©2005-2007 Carlos GuestrinMarginal likelihood What if S is hidden?FluAllergySinusHeadacheNose11©2005-2007 Carlos GuestrinLog likelihood for BNs with hiddendata Marginal likelihood – O is observed, H is hiddenFluAllergySinusHeadacheNose12©2005-2007 Carlos GuestrinE-step for BNs E-step computes probability of hidden vars h given o Corresponds to inference in BNFluAllergySinusHeadacheNose13©2005-2007 Carlos GuestrinThe M-step for BNs Maximization step: Use expected counts instead of counts: If learning requires Count(h,o) Use EQ(t+1)[Count(h,o)]FluAllergySinusHeadacheNose14©2005-2007 Carlos GuestrinM-step for each CPT M-step decomposes per CPT Standard MLE: M-step uses expected counts:FluAllergySinusHeadacheNose15©2005-2007 Carlos GuestrinComputing expected counts M-step requires expected counts: For a set of vars A, must compute ExCount(A=a) Some of A in example j will be observed denote by AO = aO(j) Some of A will be hidden denote by AH Use inference (E-step computes expected counts): ExCount(t+1)(AO = aO(j), AH = aH) ← P(AH = aH , AO = aO(j)|θ(t))FluAllergySinusHeadacheNose16©2005-2007 Carlos GuestrinData need not be hidden inthe same way When data is fully observed A data point is When data is partially observed A data point is But unobserved variables can be different for different data points e.g., Same framework, just change definition of expected counts ExCount(t+1)(AO = aO(j), AH = aH) ← P(AH = aH , AO = aO(j)|θ(t))FluAllergySinusHeadacheNose17©2005-2007 Carlos GuestrinWhat you need to know EM for Bayes Nets E-step: inference computes expected counts Only need expected counts over Xi and Paxi M-step: expected counts used to estimateparameters Hidden variables can change per datapoint Use labeled and unlabeled data ! some datapoints are complete, some include hiddenvariables©2005-2007 Carlos Guestrin18Co-Training for Semi-supervised learningMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 16th, 200719©2005-2007 Carlos GuestrinRedundant information20©2005-2007 Carlos GuestrinRedundant information – webpagetext21©2005-2007 Carlos GuestrinRedundant information – anchortext for hyperlinks22©2005-2007 Carlos GuestrinExploiting redundant information insemi-supervised learning Want to predict Y fromfeatures X f(X) a Y have some labeled data L lots of unlabeled data U Co-training assumption: X isvery expressive X = (X1,X2) can learn g1(X1) a Y g2(X2) a Y23©2005-2007 Carlos GuestrinCo-Training Key idea: Classifier1 and Classifier2 must: Correctly classify labeled data Agree on unlabeled data24©2005-2007 Carlos GuestrinCo-Training Algorithm[Blum & Mitchell ’99]25©2005-2007 Carlos GuestrinCo-Training experimental results26©2005-2007 Carlos GuestrinCo-Training theory Want to predict Y from features X f(X) a Y Co-training assumption: X is very expressive X = (X1,X2) want to learn g1(X1) a Y and g2(X2) a Y Assumption: ∃ g1, g2, ∀ x g1(x1) = f(x), g2(x2) = f(x) Questions: Does unlabeled data always help? How many labeled examples do I need? How many unlabeled examples do I need?27©2005-2007 Carlos GuestrinUnderstanding Co-Training: Asimple setting Suppose X1 and X2 are discrete |X1| = |X2| = N No label noise Without unlabeled data, how hard is it to learn g1 (or g2)?28©2005-2007 Carlos GuestrinCo-Training in simple setting –Iteration 029©2005-2007 Carlos GuestrinCo-Training in simple setting –Iteration 130©2005-2007 Carlos GuestrinCo-Training in simple setting – afterconvergence31©2005-2007 Carlos GuestrinCo-Training in simple setting –Connected components Suppose infinite unlabeled data Co-training must have at least one labeledexample in each connected component of L+Ugraph What’s probability of making an error? For k Connected components, how muchlabeled data?32©2005-2007 Carlos GuestrinHow much unlabeled data?33©2005-2007 Carlos GuestrinCo-Training theory Want to predict Y from features X f(X) a Y Co-training assumption: X is very expressive X = (X1,X2) want to learn g1(X1) a Y and g2(X2) a Y Assumption: ∃ g1, g2, ∀ x g1(x1) = f(x), g2(x2) = f(x) One co-training result [Blum & Mitchell ’99] If (X1 ⊥ X2 | Y) g1 & g2 are PAC learnable from noisy data (and thus f) Then f is
View Full Document