Learning from Labeled and Unlabeled Data Machine Learning 10 701 November 7 2006 Tom M Mitchell Machine Learning Department Carnegie Mellon University When can Unlabeled Data improve supervised learning Important question In many cases unlabeled data is plentiful labeled data expensive Medical outcomes x symptoms treatment y outcome Text classification x document y relevance Customer modeling x user actions y user intent Sensor interpretation x video audio y who s there When can Unlabeled Data help supervised learning Problem setting Set X of instances drawn from unknown distribution P X Wish to learn target function f X Y or P Y X Given a set H of possible hypotheses for f Given iid labeled examples iid unlabeled examples Wish to determine Idea 1 Use Labeled and Unlabeled Data to Train Bayes Net for P X Y Idea 1 Use Labeled and Unlabeled Data to Train Bayes Net for P X Y then infer P Y X What CPDs are needed How do we estimate them from fully observed data How do we estimate them from partly observed Y X1 X2 X3 X4 Y 1 0 0 X1 0 0 0 0 0 X2 0 1 0 1 1 X3 1 0 1 1 0 X4 1 0 0 0 1 Supervised Na ve Bayes Learner Train For each class yj of documents 1 Estimate P Y yj 2 For each word wi estimate P X wi Y yj Classify doc Assign doc to most probable class assuming words wi are conditionally independent given class What if we have labels for only some documents Learn P Y X Y X1 X2 X3 X4 Y 1 0 0 X1 0 0 0 0 0 X2 0 1 0 1 1 X3 1 0 1 1 0 X4 1 0 0 0 1 What if we have labels for only some Nigam et al 2000 documents Learn P Y X Y X1 X2 X3 X4 Y 1 0 0 X1 0 0 0 0 0 X2 0 1 0 1 1 X3 1 0 1 1 0 EM Repeat until convergence 1 Use probabilistic labels to train classifier h 2 Apply h to assign probabilistic labels to unlabeled data X4 1 0 0 0 1 Nigam et al 2000 E Step M Step wt is t th word in vocabulary Using one labeled example per class Words sorted by P w course P w course 20 Newsgroups Why When will this work What s best case Worst case How can we test which we have Summary Semisupervised Learning with EM and Na ve Bayes Model If all data is labeled corresponds to supervised training of Na ve Bayes classifier If all data unlabeled corresponds to unsupervied mixture ofmultinomial clustering If both labeled and unlabeled data then unlabeled data helps if the Bayes net model is correct e g P X is a mixture of classconditional multinomials with conditionally independent Xi Of course we could use Bayes net models other than Na ve Bayes Idea 2 Use U to reweight labeled examples Most learning algorithms minimize errors over labeled examples But we really want to minimize error over future examples drawn from the same underlying distribution ie true error of hypothesis If we know the underlying distribution P X we could weight each labeled training example x y by its probability according to P X x Unlabeled data allows us to estimate P X Idea 2 Use U to reweight labeled examples L Use to alter the loss function Wish to minimize true error 1 if hypothesis h disagrees with true function f else 0 Usually approximate this as Which equals We can produce a better approximation by incorporating U n x L number of times x occurs in L Reweighting Labeled Examples Wish to find Already have algorithm e g decision tree learner to find Just reweight examples in L and have algorithm minimize Or if X is continuous use L U to estimate p X and minimize Idea 3 CoTraining In some settings available data features are redundant and we can train two classifiers based on disjoint features In this case the two classifiers should agree on the classification for each unlabeled example Therefore we can use the unlabeled data to constrain joint training of both classifiers Redundantly Sufficient Features Professor Faloutsos my advisor Redundantly Sufficient Features Professor Faloutsos my advisor Redundantly Sufficient Features Redundantly Sufficient Features Professor Faloutsos my advisor CoTraining Algorithm 1 Blum Mitchell 1998 Given labeled data L unlabeled data U Loop Train g1 hyperlink classifier using L Train g2 page classifier using L Allow g1 to label p positive n negative examps from U Allow g2 to label p positive n negative examps from U Add these self labeled examples to L CoTraining Experimental Results begin with 12 labeled web pages academic course provide 1 000 additional unlabeled web pages average error learning from labeled data 11 1 average error cotraining 5 0 Typical run CoTraining setting wish to learn f X Y given L and U drawn from P X features describing X can be partitioned X X1 x X2 such that f can be computed from either X1 or X2 One result Blum Mitchell 1998 If X1 and X2 are conditionally independent given Y f is PAC learnable from noisy labeled data Then f is PAC learnable from weak initial classifier plus unlabeled data Co Training Rote Learner pages hyperlinks My advisor Co Training Rote Learner pages hyperlinks My advisor Expected Rote CoTraining error given m examples CoTraining setting learn f X Y where X X 1 X 2 where x drawn from unknown distribution and g1 g 2 x g1 x1 g 2 x2 f x E error P x g j 1 P x g j m j Where gj is the jth connected component of graph of L U m is number of labeled examples How many unlabeled examples suffice Want to assure that connected components in the underlying distribution GD are connected components in the observed sample GS GD GS O log N examples assure that with high probability GS has same connected components as GD Karger 94 N is size of GD is min cut over all connected components of GD PAC Generalization Bounds on CoTraining Dasgupta et al NIPS 2001 This theorem assumes X1 and X2 are conditionally independent given Y PAC Generalization Bounds on CoTraining Dasgupta et al NIPS 2001 This theorem assumes X1 and X2 are conditionally independent given Y What if CoTraining Assumption Not Perfectly Satisfied Idea Want classifiers that produce a maximally consistent labeling of the data If learning is an optimization problem what function should we optimize Example 2 Learning to extract named entities location I arrived in Beijing on Saturday If I arrived in X on Saturday Then Location X Co Training for Named Entity Extraction i e classifying which strings refer to people places dates etc Riloff Jones 98 Collins et al 98 Jones 05 Answer1 Answer2 Classifier1 Classifier2 Beijing I arrived in saturday I arrived in Beijing saturday Bootstrap learning to extract named entities Riloff and Jones 1999 Collins and Singer 1999 Initialization Australia Canada China England France Germany
View Full Document