Semi supervised learning10601 Machine LearningCan Unlabeled Data improve supervised learning?Important question! In many cases, unlabeled data is plentiful, labeled data expensive• Medical outcomes (x=<patient,treatment>, y=outcome)• Text classification (x=document, y=relevance)• Customer modeling (x=user actions, y=user intent)• …When can Unlabeled Data help supervised learning?Consider setting:• Set X of instances drawn from unknown distribution P(X)• Wish to learn target function f: XY (or, P(Y|X))• Given a set H of possible hypotheses for fGiven:• iid labeled examples• iid unlabeled examples Determine:Four Ways to Use Unlabeled Data for Supervised Learning1. Use to re-weight labeled examples2. Use to help EM learn class-specific generative models3. If problem has redundantly sufficient features, use CoTraining. 4. Use to detect/preempt overfitting1. Use unlabeled data to reweight labeled examples• Most machine learning algorithms (neural nets, decision trees, SVMs) attempt to minimize errors over labeled examples• But our ultimate goal is to minimize error over future examples drawn from the same underlying distribution• If we know the underlying distribution, we should weight each training example by its probability according to this distribution• Unlabeled data allows us to estimate this distribution more accurately, and to reweight our labeled examples accordinglyExample1. reweight labeled examples1 if hypothesis hdisagrees with true function f, else 01. reweight labeled examples1 if hypothesis hdisagrees with true function f, else 0# of times we have x in the labeled set1. reweight labeled examples1 if hypothesis hdisagrees with true function f, else 0# of times we have x in the labeled set# of times we have x in the unlabeled setExample2. Improve EM clustering algorithms• Consider completely unsupervised clustering, where we assume data X is generated by a mixture of probability distributions, one for each cluster– For example, Gaussian mixtures• Some classifier learning algorithms such as Gaussian Bayes classifiers also assumes the data X is generated by a mixture of distributions, one for each class Y• Supervised learning: estimate P(X|Y) from labeled data• Opportunity: estimate P(X|Y) from labeled and unlabeled data, using EM as in clusteringBag of Words Text Classificationaardvark 0about 2all 2Africa 1apple 0anxious 0...gas 1...oil 1…Zaire 0Baseline: Naïve Bayes LearnerTrain:For each class cjof documents1. Estimate P(cj )2. For each word wiestimate P(wi| cj )Classify (doc):Assign doc to most probable classdocwjijjicwPcP )|()(maxargNaïve Bayes assumption: words are conditionally independent, given classExpectation Maximization (EM) Algorithm• Use labeled data L to learn initial classifier hLoop:• E Step:– Assign probabilistic labels to U, based on h• M Step:– Retrain classifier h using both L (with fixed membership) and assigned labels to U (soft membership)• Under certain conditions, guaranteed to converge to locally maximum likelihood h2. Generative Bayes modelYX1X4X3X2Y X1 X2 X3 X41 0 0 1 10 0 1 0 00 0 0 1 0? 0 1 1 0? 0 1 0 1Learn P(Y|X)E Step:M Step:wtis t-th word in vocabularyUsing one labeled example per classNewsgrop postings – 20 newsgroups, 1000/groupExperimental Evaluation3. If Problem Setting Provides Redundantly Sufficient Features, use CoTraining• In some settings, available data features are so redundant that we can train two classifiers using different features• In this case, the two classifiers should agree on the classification for each unlabeled example• Therefore, we can use the unlabeled data to constrain training of both classifiers, forcing them to agree3. CoTraining)()()()(,:22112121xfxgxgxggandondistributiunknownfromdrawnxwhereXXXwhereYXflearnRedundantly Sufficient FeaturesProfessor Faloutsosmy advisorCoTraining Algorithm [Blum&Mitchell, 1998]Given: labeled data L, unlabeled data ULoop:Train g1 (hyperlink classifier) using LTrain g2 (page classifier) using LAllow g1 to label p positive, n negative examps from UAllow g2 to label p positive, n negative examps from U Add the intersection of the self-labeled examples to LCoTraining: Experimental Results• begin with 12 labeled web pages (academic course)• provide 1,000 additional unlabeled web pages• average error: learning from labeled data 11.1%; • average error: cotraining 5.0% (when both agree)Typical run:Co-Training Rote LearnerMy advisor+--pageshyperlinksClassifying Jobs for FlipDogX1: job titleX2: job description4. Use U to Detect/Preempt Overfitting• Overfitting is a problem for many learning algorithms (e.g., decision trees, neural networks)• The symptom of overfitting: complex hypothesis h2 performs better on training data than simpler hypothesis h1, but worse on test data• Unlabeled data can help detect overfitting, by comparing predictions of h1 and h2 over the unlabeled examples – The rate at which h1 and h2 disagree on U should be the same as the rate on L, unless overfitting is occuring• Definition of distance metric– Non-negative d(f,g)≥0; – symmetric d(f,g)=d(g,f); – triangle inequality d(f,g) · d(f,h)+d(h,g)• Classification with zero-one loss:• Regression with squared loss:Defining a distance metricUsing the distance metricGenerated y values contain zero mean Gaussian noise eY=f(x)+eExperimental Evaluation of TRI[Schuurmans & Southey, MLJ 2002]• Use it to select degree of polynomial for regression• Compare to alternatives such as cross validation, structural risk minimization, …Cross validation (Ten-fold)Structural risk minimizationApproximation ratio: true error of selected hypothesistrue error of best hypothesis consideredResults using 200 unlabeled, t labeledperformance in top .50 of trialsSummarySeveral ways to use unlabeled data in supervised learningOngoing research area1. Use to reweight labeled examples2. Use to help EM learn class-specific generative models3. If problem has redundantly sufficient features, use CoTraining4. Use to detect/preempt overfittingFurther Reading• EM approach: K.Nigam, et al., 2000. "Text Classification from Labeled and Unlabeled Documents using EM", Machine Learning, 39, pp.103—134.• CoTraining: A. Blum and T. Mitchell, 1998. ―Combining Labeled and Unlabeled Data with Co-Training,‖ Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT-98).• S. Dasgupta, et al., ―PAC
View Full Document