Learning from Labeled and Unlabeled DataMachine Learning 10-701November 7, 2006Tom M. MitchellMachine Learning DepartmentCarnegie Mellon UniversityWhen can Unlabeled Data improve supervised learning?Important question! In many cases, unlabeled data is plentiful, labeled data expensive• Medical outcomes (x=<symptoms,treatment>, y=outcome)• Text classification (x=document, y=relevance)• Customer modeling (x=user actions, y=user intent)• Sensor interpretation (x=<video,audio>, y=who’s there)When can Unlabeled Data help supervised learning?Problem setting:•Set X of instances drawn from unknown distribution P(X)• Wish to learn target function f: XÆY (or, P(Y|X))• Given a set H of possible hypotheses for fGiven:• iid labeled examples• iid unlabeled examples Wish to determine:Idea 1: Use Labeled and Unlabeled Data to Train Bayes Net for P(X,Y)Idea 1: Use Labeled and Unlabeled Data to Train Bayes Net for P(X,Y), then infer P(Y|X)YX1X4X3X21010?0110?010000010011001X4X3X2X1YWhat CPDs are needed?How do we estimate them from fully observed data?How do we estimate them from partly observed?Supervised: Naïve Bayes LearnerTrain:For each class yjof documents1. Estimate P(Y=yj)2. For each word wiestimate P(X=wi| Y=yj)Classify (doc):Assign doc to most probable class* assuming words wiare conditionally independent, given class*What if we have labels for only somedocuments? YX1X4X3X21010?0110?010000010011001X4X3X2X1YLearn P(Y|X)What if we have labels for only somedocuments? YX1X4X3X21010?0110?010000010011001X4X3X2X1YLearn P(Y|X)EM: Repeat until convergence1. Use probabilistic labels to train classifier h2. Apply h to assign probabilistic labels to unlabeled data[Nigam et al., 2000]E Step:M Step:wtis t-th word in vocabulary[Nigam et al., 2000]Using one labeled example per classWords sorted by P(w|course) / P(w| ¬ course)20 NewsgroupsWhy/When will this work?• What’s best case? Worst case? How can we test which we have?Summary : Semisupervised Learning with EM and Naïve Bayes Model • If all data is labeled, corresponds to supervised training of Naïve Bayes classifier• If all data unlabeled, corresponds to unsupervied, mixture-of-multinomial clustering• If both labeled and unlabeled data, then unlabeled data helps if the Bayes net model is correct (e.g., P(X) is a mixture of class-conditional multinomials with conditionally independent Xi ) • Of course we could use Bayes net models other than Naïve BayesIdea 2: Use U to reweight labeled examples• Most learning algorithms minimize errors over labeled examples• But we really want to minimize error over future examples drawn from the same underlying distribution (ie., true error of hypothesis)• If we know the underlying distribution P(X), we could weight each labeled training example <x,y> by its probability according to P(X=x)• Unlabeled data allows us to estimate P(X)Idea 2: Use U to reweight labeled examples L1 if hypothesis hdisagrees with true function f, else 0• We can produce a better approximation by incorporating U:• Wish to minimize true error:Use to alter the loss functionWhich equals:• Usually approximate this as:n(x,L) = number of times x occurs in LReweighting Labeled Examples• Wish to find• Already have algorithm (e.g., decision tree learner) to find• Just reweight examples in L, and have algorithm minimize• Or if X is continuous, use L+U to estimate p(X), and minimizeIdea 3: CoTraining• In some settings, available data features are redundant and we can train two classifiers based on disjoint features• In this case, the two classifiers should agree on the classification for each unlabeled example• Therefore, we can use the unlabeled data to constrain joint training of both classifiersRedundantly Sufficient FeaturesProfessor Faloutsosmy advisorRedundantly Sufficient FeaturesProfessor Faloutsosmy advisorRedundantly Sufficient FeaturesRedundantly Sufficient FeaturesProfessor Faloutsosmy advisorCoTraining Algorithm #1 [Blum&Mitchell, 1998]Given: labeled data L, unlabeled data ULoop:Train g1 (hyperlink classifier) using LTrain g2 (page classifier) using LAllow g1 to label p positive, n negative examps from UAllow g2 to label p positive, n negative examps from U Add these self-labeled examples to LCoTraining: Experimental Results• begin with 12 labeled web pages (academic course)• provide 1,000 additional unlabeled web pages• average error: learning from labeled data 11.1%; • average error: cotraining 5.0%Typical run:One result [Blum&Mitchell 1998]: •If– X1 and X2 are conditionally independent given Y– f is PAC learnable from noisy labeled data• Then– f is PAC learnable from weak initial classifier plus unlabeleddataCoTraining setting:• wish to learn f: X Æ Y, given L and U drawn from P(X)• features describing X can be partitioned (X = X1 x X2)such that f can be computed from either X1 or X2Co-Training Rote LearnerMy advisor+--pageshyperlinksCo-Training Rote LearnerMy advisor+--pageshyperlinks----++---++++--++-Expected Rote CoTraining error given m examples[]mjjjgxPgxPerrorE ))(1)(( ∈−∈=∑Where g is the jth connected component of graph of L+U, m is number of labeled examplesj)()()()(,::22112121xfxgxgxggandondistributiunknownfromdrawnxwhereXXXwhereYXflearnsettingCoTraining==∀∃×=→How many unlabeled examples suffice?Want to assure that connected components in the underlying distribution, GD, are connected components in the observed sample, GSGDGSO(log(N)/α) examples assure that with high probability, GShas same connected components as GD[Karger, 94]N is size of GD, α is min cut over all connected components of GDPAC Generalization Bounds on CoTraining[Dasgupta et al., NIPS 2001]This theorem assumes X1 and X2 are conditionally independent given YPAC Generalization Bounds on CoTraining[Dasgupta et al., NIPS 2001]This theorem assumes X1 and X2 are conditionally independent given Y• Idea: Want classifiers that produce a maximally consistent labeling of the data• If learning is an optimization problem, what function should we optimize?What if CoTraining Assumption Not Perfectly Satisfied?-+++Example 2: Learning to extract named entitiesI arrived in Beijing on Saturday.location?If: “I arrived in <X> on Saturday.”Then: Location(X)Co-Training for Named Entity Extraction(i.e.,classifying which strings refer to people, places, dates, etc.)Answer1Classifier1Answer2 Classifier2I arrived in Beijing saturday.BeijingI arrived in __
View Full Document