Expectation Maximization and Learning from Partly Unobserved Data part 2 Machine Learning 10 701 April 2005 Tom M Mitchell Carnegie Mellon University Outline Clustering K means EM Mixture of Gaussians Training classifiers with partially unlabeled data Na ve Bayes and EM Reweighting the labeled examples using P X Co training Regularization based on 1 Unsupervised Clustering K means and Mixtures of Gaussians Clustering Given set of data points group them Unsupervised learning Which patients are similar or which earthquakes customers faces K means Clustering Given data x1 xn and K assign each xi to one of K clusters C1 CK minimizing Where is mean over all points in cluster Cj K Means Algorithm Initialize randomly Repeat until convergence 1 Assign each point xi to the cluster with the closest mean j 2 Calculate the new mean for each cluster K Means applet Run applet Try 3 clusters 15 pts Mixtures of Gaussians K means is EM ish but makes hard assignments of xi to clusters Let s derive a real EM algorithm for clustering What object function shall we optimize Maximize data likelihood What form of P X should we assume Mixture of Gaussians Mixture distribution Assume P x is a mixture of K different Gaussians Assume each data point x is generated by 2 step process 1 z choose one of the K Gaussians according to 1 K 2 Generate x according to the Gaussian N z z EM for Mixture of Gaussians Simplify to make this easier 1 assume Xi are conditionally independent given Z 2 assume only 2 classes and assume Z 3 Assume known 1 K 1i Ki unknown Observed X Unobserved Z X1 X2 X3 X4 Z EM Given observed variables X unobserved Z Define where X1 X2 Iterate until convergence E Step Calculate P Z n X n for each example X n Use this to construct M Step Replace current by X3 X4 Z EM E Step Calculate P Z n X n for each observed example X n X n x1 n x2 n xT n X1 X2 X3 X4 First consider update for EM M Step Z has no influence Count z n 1 X1 X2 X3 X4 Now consider update for ji EM M Step Z ji has no influence X1 Compare above to MLE if Z were observable X2 X3 X4 Mixture of Gaussians applet Run applet Try 2 clusters See different local minima with different random starts K Means vs Mixture of Gaussians Both are iterative algorithms to assign points to clusters Objective function K Means minimize MixGaussians maximize P X MixGaussians is the more general formulation Equivalent to K Means when k I and 0 Using Unlabeled Data to Help Train Na ve Bayes Classifier Learn P Y X Y X1 X2 X3 X4 Y X1 X2 X3 X4 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 1 From Nigam et al 2000 E Step M Step wt is t th word in vocabulary Elaboration 1 Downweight the influence of unlabeled examples by factor New M step Chosen by cross validation Using one labeled example per class Experimental Evaluation Newsgroup postings 20 newsgroups 1000 group Web page classification student faculty course project 4199 web pages Reuters newswire articles 12 902 articles 90 topics categories 20 Newsgroups 20 Newsgroups Combining Labeled and Unlabeled Data How else can unlabeled data be useful for supervised learning function approximation 1 Use U to reweight labeled examples 1 if hypothesis h disagrees with true function f else 0 3 If Problem Setting Provides Redundantly Sufficient Features use CoTraining In some settings available data features are redundant and we can train two classifiers using different features In this case the two classifiers should at least agree on the classification for each unlabeled example Therefore we can use the unlabeled data to constrain training of both classifiers forcing them to agree Redundantly Sufficient Features Professor Faloutsos my advisor Redundantly Sufficient Features Professor Faloutsos my advisor Redundantly Sufficient Features Redundantly Sufficient Features Professor Faloutsos my advisor CoTraining Algorithm 1 Blum Mitchell 1998 Given labeled data L unlabeled data U Loop Train g1 hyperlink classifier using L Train g2 page classifier using L Allow g1 to label p positive n negative examps from U Allow g2 to label p positive n negative examps from U Add these self labeled examples to L CoTraining Experimental Results begin with 12 labeled web pages academic course provide 1 000 additional unlabeled web pages average error learning from labeled data 11 1 average error cotraining 5 0 Typical run CoTraining Setting learn f X Y where X X 1 X 2 where x drawn from unknown distribution and g1 g 2 x g1 x1 g 2 x2 f x If x1 x2 conditionally independent given y f is PAC learnable from noisy labeled data Then f is PAC learnable from weak initial classifier plus unlabeled data Co Training Rote Learner pages hyperlinks My advisor Co Training Rote Learner pages hyperlinks My advisor Expected Rote CoTraining error given m examples CoTraining setting learn f X Y where X X 1 X 2 where x drawn from unknown distribution and g1 g 2 x g1 x1 g 2 x2 f x E error P x g j 1 P x g j m j Where gj is the jth connected component of graph How many unlabeled examples suffice Want to assure that connected components in the underlying distribution GD are connected components in the observed sample GS GD GS O log N examples assure that with high probability GS has same connected components as GD Karger 94 N is size of GD is min cut over all connected components of GD PAC Generalization Bounds on CoTraining Dasgupta et al NIPS 2001 What if CoTraining Assumption Not Perfectly Satisfied Idea Want classifiers that produce a maximally consistent labeling of the data If learning is an optimization problem what function should we optimize What Objective Function E E1 E 2 c3 E 3 c4 E 4 E1 2 y g x 1 1 x y L E2 2 y g x 2 2 Error on labeled examples Disagreement over unlabeled x y L E 3 g 1 x1 g 2 x2 2 Misfit to estimated class priors x U 1 1 g x g x 1 1 2 2 E 4 y L x y L L U x L U 2 2 What Function Approximators g 1 x 1 1 e w j 1 x j j g 2 x 1 1 e wj 2x j j Same fn form as Na ve Bayes Max Entropy Use gradient descent to simultaneously learn g1 and g2 directly minimizing E E1 E2 E3 E4 No word independence assumption use both labeled and unlabeled data Classifying Jobs for FlipDog X1 job title X2 job description Gradient CoTraining Classifying FlipDog job descriptions SysAdmin vs WebProgrammer Final Accuracy Labeled data alone 86 CoTraining 96 Gradient CoTraining Classifying Upper Case sequences as Person Names Error Rates Using labeled data only Cotraining Cotraining without fitting class priors E4 25 labeled 2300 labeled 5000 unlabeled 5000 unlabeled …
View Full Document